Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2094

LineRecordReader should not seek into non-splittable, compressed streams.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0
    • Component/s: task
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times.

      A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed.

      It took a while to figure out and what we found is that the default implementation of the isSplittable method in org.apache.hadoop.mapreduce.lib.input.FileInputFormat is simply "return true;".

      This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. " . The actual implementation effectively does "Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. "

      For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does "return false; "

      Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable):

      1. Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes.
      2. "Force" developers to think about it and make this method abstract.
      3. Use a "safe" default (i.e. return false)
      1. MAPREDUCE-2094-FileInputFormat-docs-v2.patch
        4 kB
        Gian Merlino
      2. MAPREDUCE-2094-2015-05-05-2328.patch
        11 kB
        Niels Basjes
      3. MAPREDUCE-2094-20140727-svn-fixed-spaces.patch
        11 kB
        Niels Basjes
      4. MAPREDUCE-2094-20140727-svn.patch
        10 kB
        Niels Basjes
      5. MAPREDUCE-2094-20140727.patch
        11 kB
        Niels Basjes
      6. MAPREDUCE-2094-2011-05-19.patch
        6 kB
        Niels Basjes
      7. M2094-1.patch
        11 kB
        Chris Douglas
      8. M2094.patch
        10 kB
        Chris Douglas

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2138 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2138/)
          MAPREDUCE-2094. LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2138 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2138/ ) MAPREDUCE-2094 . LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #190 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/190/)
          MAPREDUCE-2094. LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #190 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/190/ ) MAPREDUCE-2094 . LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/)
          MAPREDUCE-2094. LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/ ) MAPREDUCE-2094 . LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/)
          MAPREDUCE-2094. LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/ ) MAPREDUCE-2094 . LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/)
          MAPREDUCE-2094. LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/ ) MAPREDUCE-2094 . LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/)
          MAPREDUCE-2094. LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/ ) MAPREDUCE-2094 . LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7779 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7779/)
          MAPREDUCE-2094. LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7779 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7779/ ) MAPREDUCE-2094 . LineRecordReader should not seek into non-splittable, compressed streams. (cdouglas: rev 2edcf931d7843cddcf3da5666a73d6ee9a10d00d) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/TestSafeguardSplittingUnsplittableFiles.txt.gz hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/pom.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
          Hide
          Chris Douglas added a comment -

          +1

          I committed this to trunk and branch-2. Thanks Niels

          Show
          Chris Douglas added a comment - +1 I committed this to trunk and branch-2. Thanks Niels
          Hide
          Chris Douglas added a comment -

          Ran test-patch locally, all OK except a spurious whitespace and a release audit warning (fixed)

          Show
          Chris Douglas added a comment - Ran test-patch locally, all OK except a spurious whitespace and a release audit warning (fixed)
          Hide
          Chris Douglas added a comment -

          Rewrote exception message.

          Niels Basjes, I know you think this undersells the severity of the bug. I'll rewrite the description to limit the scope of this fix, if you still want to litigate the point in another JIRA.

          Show
          Chris Douglas added a comment - Rewrote exception message. Niels Basjes , I know you think this undersells the severity of the bug. I'll rewrite the description to limit the scope of this fix, if you still want to litigate the point in another JIRA.
          Hide
          Niels Basjes added a comment -

          P.S. I'm unable to provide an updated patch for the next week or so. So please make up a good message and so people can start catching this problem.

          Show
          Niels Basjes added a comment - P.S. I'm unable to provide an updated patch for the next week or so. So please make up a good message and so people can start catching this problem.
          Hide
          Niels Basjes added a comment -

          I understand you want the error message to be 'clean'. Normally I would do that too.
          This message can however only appear if you are using (or have been using for a long time) a (usually custom) FileInputFormat that has been corrupting your results (for perhaps even years ... note I created this bug report about 4.5 years ago).
          I think it is important to clarify the impact of the problem the author of the custom code introduced themselves so they immediately understand what went wrong.
          And yes... this message is bit over the top ...

          Atleast let's make the message easier to understand what went wrong and make aware of the historical implications.
          How about "A split was attempted for a file that is being decompressed by " + codec.getClass().getSimpleName() + " which does not support splitting. Note that this would have corrupted the data in older Hadoop versions."

          Show
          Niels Basjes added a comment - I understand you want the error message to be 'clean'. Normally I would do that too. This message can however only appear if you are using (or have been using for a long time) a (usually custom) FileInputFormat that has been corrupting your results (for perhaps even years ... note I created this bug report about 4.5 years ago). I think it is important to clarify the impact of the problem the author of the custom code introduced themselves so they immediately understand what went wrong. And yes... this message is bit over the top ... Atleast let's make the message easier to understand what went wrong and make aware of the historical implications. How about "A split was attempted for a file that is being decompressed by " + codec.getClass().getSimpleName() + " which does not support splitting. Note that this would have corrupted the data in older Hadoop versions."
          Hide
          Chris Douglas added a comment -

          Given discussion on the dev list, the following error message:

          +          throw new IOException(
          +                    "Implementation bug in the used FileInputFormat: " +
          +                    "The isSplitable method returned 'true' on a file that " +
          +                    "was compressed with a non splittable compression codec. " +
          +                    "If you get this right after upgrading Hadoop then know "+
          +                    "that you have been looking at reports based on " +
          +                    "corrupt data for a long time !!! (see: MAPREDUCE-2094)");
          

          is a little over the top. Please just report the error detected e.g., "Cannot seek in " + codec.getClass().getSimpleName() + " compressed stream"

          Show
          Chris Douglas added a comment - Given discussion on the dev list, the following error message: + throw new IOException( + "Implementation bug in the used FileInputFormat: " + + "The isSplitable method returned 'true' on a file that " + + "was compressed with a non splittable compression codec. " + + "If you get this right after upgrading Hadoop then know "+ + "that you have been looking at reports based on " + + "corrupt data for a long time !!! (see: MAPREDUCE-2094)"); is a little over the top. Please just report the error detected e.g., "Cannot seek in " + codec.getClass().getSimpleName() + " compressed stream"
          Hide
          Allen Wittenauer added a comment -

          We're talking about the whitespace plugin being broken over in HADOOP-11923.

          Show
          Allen Wittenauer added a comment - We're talking about the whitespace plugin being broken over in HADOOP-11923 .
          Hide
          Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 14m 57s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 3 new or modified test files.
          +1 javac 7m 40s There were no new javac warning messages.
          +1 javadoc 9m 47s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 53s There were no new checkstyle issues.
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 31s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 1m 16s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
          +1 mapreduce tests 1m 38s Tests passed in hadoop-mapreduce-client-core.
              38m 46s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12730617/MAPREDUCE-2094-2015-05-05-2328.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 0100b15
          whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/artifact/patchprocess/whitespace.txt
          hadoop-mapreduce-client-core test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/testReport/
          Java 1.7.0_55
          uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 57s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 3 new or modified test files. +1 javac 7m 40s There were no new javac warning messages. +1 javadoc 9m 47s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 53s There were no new checkstyle issues. -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 31s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 16s The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 mapreduce tests 1m 38s Tests passed in hadoop-mapreduce-client-core.     38m 46s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12730617/MAPREDUCE-2094-2015-05-05-2328.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 0100b15 whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/artifact/patchprocess/whitespace.txt hadoop-mapreduce-client-core test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/testReport/ Java 1.7.0_55 uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/console This message was automatically generated.
          Hide
          Niels Basjes added a comment -

          This should work

          Show
          Niels Basjes added a comment - This should work
          Hide
          Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 0s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12730601/MAPREDUCE-2094-20140727-svn-fixed-spaces.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 0100b15
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5647/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12730601/MAPREDUCE-2094-20140727-svn-fixed-spaces.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 0100b15 Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5647/console This message was automatically generated.
          Hide
          Niels Basjes added a comment -

          Removed the trailing spaces from the lines I touched.

          @Allen Wittenauer: Apparently the trailing spaces check is also triggered by the trailing space in one of the 'surrounding' lines in the patch file. How should this be handled?

          Show
          Niels Basjes added a comment - Removed the trailing spaces from the lines I touched. @ Allen Wittenauer : Apparently the trailing spaces check is also triggered by the trailing space in one of the 'surrounding' lines in the patch file. How should this be handled?
          Hide
          Niels Basjes added a comment -

          I removed the trailing spaces from the lines I touched.
          One line that was reported however is a line I didn't touch in my patch.

          Question: What should I do about that?

          Show
          Niels Basjes added a comment - I removed the trailing spaces from the lines I touched. One line that was reported however is a line I didn't touch in my patch. Question: What should I do about that?
          Hide
          Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 14m 31s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 3 new or modified test files.
          +1 javac 7m 27s There were no new javac warning messages.
          +1 javadoc 9m 37s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 1m 3s There were no new checkstyle issues.
          -1 whitespace 0m 1s The patch has 9 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 35s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 1m 15s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
          +1 mapreduce tests 1m 34s Tests passed in hadoop-mapreduce-client-core.
              38m 0s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12658039/MAPREDUCE-2094-20140727-svn.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 6ae2a0d
          whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/artifact/patchprocess/whitespace.txt
          hadoop-mapreduce-client-core test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 31s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 3 new or modified test files. +1 javac 7m 27s There were no new javac warning messages. +1 javadoc 9m 37s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 1m 3s There were no new checkstyle issues. -1 whitespace 0m 1s The patch has 9 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 35s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 1m 15s The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 mapreduce tests 1m 34s Tests passed in hadoop-mapreduce-client-core.     38m 0s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12658039/MAPREDUCE-2094-20140727-svn.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 6ae2a0d whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/artifact/patchprocess/whitespace.txt hadoop-mapreduce-client-core test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/console This message was automatically generated.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12658039/MAPREDUCE-2094-20140727-svn.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4772//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4772//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658039/MAPREDUCE-2094-20140727-svn.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4772//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4772//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          This patch no longer contains the binary file.

          Show
          Niels Basjes added a comment - This patch no longer contains the binary file.
          Hide
          Niels Basjes added a comment -

          Created new patch file via the svn route.
          In this the binary file is now an ASCII file. This is possible because the content of the file is irrelevant, only the filename and the size (> 3 bytes) are what matters.

          Show
          Niels Basjes added a comment - Created new patch file via the svn route. In this the binary file is now an ASCII file. This is possible because the content of the file is irrelevant, only the filename and the size (> 3 bytes) are what matters.
          Hide
          Niels Basjes added a comment -

          Problem with binary file in the patch file

          Show
          Niels Basjes added a comment - Problem with binary file in the patch file
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          This "Fix" does not solve the problem in all scenarios as this is a rabbit hole proven to be too deep.

          This does however fix all scenarios where a new FileInputFormat is created that also uses the LineRecordReader (or a subclass) and forget to implement the correct isSplitable. This is exactly the case that happens if someone were to follow the example from the Yahoo Hadoop tutorial ( https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat ) to create their own inputformat.

          This patch makes the job fail when it detects the corruption scenario.

          I haven't tried it yet but with this patch; if you feed a large Gzipped file into the software from the mentioned tutorial it should now fail the entire job hard the moment it creates a second split.

          NOTE 1: This patch includes all the JavaDoc improvements created by Gian Merlino.
          NOTE 2: To create a unit test I needed to include a small gzipped file in the patch. I hope I did that the right way.

          Show
          Niels Basjes added a comment - This "Fix" does not solve the problem in all scenarios as this is a rabbit hole proven to be too deep. This does however fix all scenarios where a new FileInputFormat is created that also uses the LineRecordReader (or a subclass) and forget to implement the correct isSplitable. This is exactly the case that happens if someone were to follow the example from the Yahoo Hadoop tutorial ( https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat ) to create their own inputformat. This patch makes the job fail when it detects the corruption scenario. I haven't tried it yet but with this patch; if you feed a large Gzipped file into the software from the mentioned tutorial it should now fail the entire job hard the moment it creates a second split. NOTE 1: This patch includes all the JavaDoc improvements created by Gian Merlino . NOTE 2: To create a unit test I needed to include a small gzipped file in the patch. I hope I did that the right way.
          Hide
          Niels Basjes added a comment -

          This is the patch I created to throw an exception in the most common case where the described problem occurs.
          So this is a "Fail hard when things already went wrong" patch.

          NOTE 1: This patch includes all the JavaDoc improvements created by Gian Merlino.
          NOTE 2: To create a unit test I needed to include a small gzipped file in the patch. I hope I did that the right way.

          Show
          Niels Basjes added a comment - This is the patch I created to throw an exception in the most common case where the described problem occurs. So this is a "Fail hard when things already went wrong" patch. NOTE 1: This patch includes all the JavaDoc improvements created by Gian Merlino . NOTE 2: To create a unit test I needed to include a small gzipped file in the patch. I hope I did that the right way.
          Hide
          Gian Merlino added a comment -

          I did use the LineRecordReader. I think a safety net like the one you described would have been useful.

          Show
          Gian Merlino added a comment - I did use the LineRecordReader. I think a safety net like the one you described would have been useful.
          Hide
          Niels Basjes added a comment -

          Question for Gian Merlino, Garth Goodson and others who have faced this problem.
          Did you use the 'out of the box' LineRecordReader (or a subclass of that) as shown in https://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat or did you write something 'completely new'?

          When I ran into this I followed the tutorial.

          I think the 'next best' spot to stop most problem scenarios (i.e. everyone who implements it like the tutorial shows) can be caught by letting the LineRecordReader fail the entire job when it is initialized with non splittable codec file and the provided split is not the entire file.

          What do you think?

          Show
          Niels Basjes added a comment - Question for Gian Merlino , Garth Goodson and others who have faced this problem. Did you use the 'out of the box' LineRecordReader (or a subclass of that) as shown in https://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat or did you write something 'completely new'? When I ran into this I followed the tutorial. I think the 'next best' spot to stop most problem scenarios (i.e. everyone who implements it like the tutorial shows) can be caught by letting the LineRecordReader fail the entire job when it is initialized with non splittable codec file and the provided split is not the entire file. What do you think?
          Hide
          Gian Merlino added a comment -

          I just attached a different patch that also adjusts the class-level javadocs in addition to the methods.

          Show
          Gian Merlino added a comment - I just attached a different patch that also adjusts the class-level javadocs in addition to the methods.
          Hide
          Gian Merlino added a comment -

          We just hit this bug too. I'd really prefer a safer default (like "return false" or something like Niels's patch) but if that is not doable, better docs would have helped. The current docs imply that the default implementation handles splittable vs non-splittable files, even though it doesn't.

          I've attached some wording that I think is more clear.

          Show
          Gian Merlino added a comment - We just hit this bug too. I'd really prefer a safer default (like "return false" or something like Niels's patch) but if that is not doable, better docs would have helped. The current docs imply that the default implementation handles splittable vs non-splittable files, even though it doesn't. I've attached some wording that I think is more clear.
          Hide
          Garth Goodson added a comment -

          We just hit this bug. I'm very surprised that it hasn't been fixed. Having a default isSplitable that just returns true is wrong behavior. This ended up causing data corruption for our customers. The default behavior should check whether the codec is splitable like other input formats do.

          Show
          Garth Goodson added a comment - We just hit this bug. I'm very surprised that it hasn't been fixed. Having a default isSplitable that just returns true is wrong behavior. This ended up causing data corruption for our customers. The default behavior should check whether the codec is splitable like other input formats do.
          Hide
          Todd Lipcon added a comment -

          cancelling patch for discussion

          Show
          Todd Lipcon added a comment - cancelling patch for discussion
          Hide
          Todd Lipcon added a comment -

          I think this is still incompatible.

          I think the majority of file-based input formats should be splittable - it's the whole reason that MR scales to big files. It's the minority of cases where files shouldn't be splittable.

          IMO the correct fix for this issue is to do a better job of documenting FileInputFormat.isSplitable() and the class-level javadoc on FileInputFormat. This way, those who implement it themselves will take note that they need to override it.

          Show
          Todd Lipcon added a comment - I think this is still incompatible. I think the majority of file-based input formats should be splittable - it's the whole reason that MR scales to big files. It's the minority of cases where files shouldn't be splittable. IMO the correct fix for this issue is to do a better job of documenting FileInputFormat.isSplitable() and the class-level javadoc on FileInputFormat. This way, those who implement it themselves will take note that they need to override it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479785/MAPREDUCE-2094-2011-05-19.patch
          against trunk revision 1124553.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479785/MAPREDUCE-2094-2011-05-19.patch against trunk revision 1124553. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          I've not marked this as an "Incompatible change" because the behaviour is only changed in the situations where there was an error condition.

          Show
          Niels Basjes added a comment - I've not marked this as an "Incompatible change" because the behaviour is only changed in the situations where there was an error condition.
          Hide
          Niels Basjes added a comment -

          I've created a patch that in my mind fixes this issue the correct way. If this is really the case if very open to discussion.

          There are basically 4 current situations:

          1. People use the existing FileInputFormat derivatives. This patch ensures that all of those that are present in the existing code base (including the examples) still retain the same behavior as before.
          2. People have created a new derivative and HAVE overridden isSplitable with something that fits their needs. This patch does not change those situations.
          3. People have created a new derivative and have NOT overridden isSplitable.
            1. If their input is in a splittable form (like LZO or uncompressed); then this patch will not affect them
            2. In the situation where they have big non-splittable input files (like gzip files) they will have ran into unexpected errors. Possibly they will spent a lot of time looking for performance issues and wrong results in production that did not occur during unit testing (we did!). This patch will fix this problem without any code changes in their code base.
          Show
          Niels Basjes added a comment - I've created a patch that in my mind fixes this issue the correct way. If this is really the case if very open to discussion. There are basically 4 current situations: People use the existing FileInputFormat derivatives. This patch ensures that all of those that are present in the existing code base (including the examples) still retain the same behavior as before. People have created a new derivative and HAVE overridden isSplitable with something that fits their needs. This patch does not change those situations. People have created a new derivative and have NOT overridden isSplitable. If their input is in a splittable form (like LZO or uncompressed); then this patch will not affect them In the situation where they have big non-splittable input files (like gzip files) they will have ran into unexpected errors. Possibly they will spent a lot of time looking for performance issues and wrong results in production that did not occur during unit testing (we did!). This patch will fix this problem without any code changes in their code base.
          Hide
          Niels Basjes added a comment -

          I just noticed that the Yahoo Hadoop tutorial "Module 5: Advanced MapReduce Features " shows a code example for defining your own FileInputFormat.
          The shown example code implements a derivative using FileInputFormat and LineRecordReader without overruling isSplittable ... I expect this tutorial code to lead people into this bug.

          Since this bug will only become apparent when using large "non splittable" (gzipped) input files it is also important to notice that almost no one will have a (unit) test that will trip on this bug.

          Show
          Niels Basjes added a comment - I just noticed that the Yahoo Hadoop tutorial " Module 5: Advanced MapReduce Features " shows a code example for defining your own FileInputFormat . The shown example code implements a derivative using FileInputFormat and LineRecordReader without overruling isSplittable ... I expect this tutorial code to lead people into this bug. Since this bug will only become apparent when using large "non splittable" (gzipped) input files it is also important to notice that almost no one will have a (unit) test that will trip on this bug.
          Hide
          Niels Basjes added a comment -

          Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses.

          If you mean with "breaking" : "Some subclasses will see an unexpected performance degradation" . then Yes, that will most likely occur ( first one I think of is SequenceFileInputFormat ).
          I however do not expect any functional "breaking" of the output of these subclasses.

          Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect.

          Yes.
          The downside of this option is that if subclasses want to have detection depending on the compression I expect a lot of code duplication to occur.
          This code duplication is already occurring within the main code base in KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old API counterparts (I found a total of 5 duplicates of the same isSplittable implementation).

          Perhaps the javadoc should just be clarified to better document this default?

          Definitely an option. However this would not fix the effect in the existing subclasses.

          I just did a quick manual code check of the current trunk and I found that the following classes are derived from FileInputFormat yet do not implement the isSplittable method (and thus use "return true").

          • ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java
          • ./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java
          • ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java

          I expect that the NLineInputFormat and the AutoInputFormat will affected by this "large gzip" bug.
          So expect that simply fixing the isSplittable documentation would lead to the need to fix at least these two classes.

          As far as I understand the SequenceFileInputFormat can only be compressed using a "splittable" compression, so the "return true;" from FileInputFormat will work fine there.

          Overall I still prefer the clean option of returning the correct value depending on the compression. That would effectively leave the behavior in most use cases unchanged. Yet in those cases where splitting is known to cause problems it would avoid those problems. Thus avoiding major issues like the ones we had and described in HADOOP-6901.
          For the SequenceFileInputFormat it may be needed to implement the isSplittable as "return true;"

          Effectively the set of changes I propose (in both the old and new API versions of these classes):
          1) FileInputFormat . isSplittable gets the implementation as seen in TextInputFormat
          2) The isSplittable implementation is removed from KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat (useless code duplication)
          3) The isSplittable implementation "return true" is added to SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I expect this to be an optional fix.

          Show
          Niels Basjes added a comment - Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses. If you mean with "breaking" : "Some subclasses will see an unexpected performance degradation" . then Yes, that will most likely occur ( first one I think of is SequenceFileInputFormat ). I however do not expect any functional "breaking" of the output of these subclasses. Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect. Yes. The downside of this option is that if subclasses want to have detection depending on the compression I expect a lot of code duplication to occur. This code duplication is already occurring within the main code base in KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old API counterparts (I found a total of 5 duplicates of the same isSplittable implementation). Perhaps the javadoc should just be clarified to better document this default? Definitely an option. However this would not fix the effect in the existing subclasses. I just did a quick manual code check of the current trunk and I found that the following classes are derived from FileInputFormat yet do not implement the isSplittable method (and thus use "return true"). ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java ./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java I expect that the NLineInputFormat and the AutoInputFormat will affected by this "large gzip" bug. So expect that simply fixing the isSplittable documentation would lead to the need to fix at least these two classes. As far as I understand the SequenceFileInputFormat can only be compressed using a "splittable" compression, so the "return true;" from FileInputFormat will work fine there. Overall I still prefer the clean option of returning the correct value depending on the compression. That would effectively leave the behavior in most use cases unchanged. Yet in those cases where splitting is known to cause problems it would avoid those problems. Thus avoiding major issues like the ones we had and described in HADOOP-6901 . For the SequenceFileInputFormat it may be needed to implement the isSplittable as "return true;" Effectively the set of changes I propose (in both the old and new API versions of these classes): 1) FileInputFormat . isSplittable gets the implementation as seen in TextInputFormat 2) The isSplittable implementation is removed from KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat (useless code duplication) 3) The isSplittable implementation "return true" is added to SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I expect this to be an optional fix.
          Hide
          Doug Cutting added a comment -

          Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses. Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect.

          Perhaps the javadoc should just be clarified to better document this default?

          Show
          Doug Cutting added a comment - Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses. Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect. Perhaps the javadoc should just be clarified to better document this default?
          Hide
          Niels Basjes added a comment -

          It appears that this issue may have the same root cause.

          Show
          Niels Basjes added a comment - It appears that this issue may have the same root cause.

            People

            • Assignee:
              Niels Basjes
              Reporter:
              Niels Basjes
            • Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development