Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2094

LineRecordReader should not seek into non-splittable, compressed streams.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0
    • Component/s: task
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times.

      A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed.

      It took a while to figure out and what we found is that the default implementation of the isSplittable method in org.apache.hadoop.mapreduce.lib.input.FileInputFormat is simply "return true;".

      This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. " . The actual implementation effectively does "Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. "

      For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does "return false; "

      Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable):

      1. Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes.
      2. "Force" developers to think about it and make this method abstract.
      3. Use a "safe" default (i.e. return false)
      1. M2094.patch
        10 kB
        Chris Douglas
      2. M2094-1.patch
        11 kB
        Chris Douglas
      3. MAPREDUCE-2094-2011-05-19.patch
        6 kB
        Niels Basjes
      4. MAPREDUCE-2094-20140727.patch
        11 kB
        Niels Basjes
      5. MAPREDUCE-2094-20140727-svn.patch
        10 kB
        Niels Basjes
      6. MAPREDUCE-2094-20140727-svn-fixed-spaces.patch
        11 kB
        Niels Basjes
      7. MAPREDUCE-2094-2015-05-05-2328.patch
        11 kB
        Niels Basjes
      8. MAPREDUCE-2094-FileInputFormat-docs-v2.patch
        4 kB
        Gian Merlino

        Issue Links

          Activity

          Niels Basjes created issue -
          Niels Basjes made changes -
          Field Original Value New Value
          Link This issue relates to HADOOP-6901 [ HADOOP-6901 ]
          Niels Basjes made changes -
          Issue Type Improvement [ 4 ] Bug [ 1 ]
          Niels Basjes made changes -
          Description When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed.

          It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply "return true;". This is a very unsafe default and is in contradcition with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. "

          For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable inour class that does "return false; "

          Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable):
          # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes.
          # "Force" developers to think about it and make this method (and therfor the entire FileInputFormat class) abstract.
          # Use a "safe" default (i.e. return false)
          When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times.

          A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed.

          It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply "return true;".

          This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. " . The actual implementation effectively does "Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. "

          For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does "return false; "

          Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable):
          # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes.
          # "Force" developers to think about it and make this method abstract.
          # Use a "safe" default (i.e. return false)
          Niels Basjes made changes -
          Attachment MAPREDUCE-2094-2011-05-19.patch [ 12479785 ]
          Niels Basjes made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Fixed splitting errors present in many FileInputFormat derivatives that do not override isSplitable.
          Affects Version/s 0.21.0 [ 12314045 ]
          Affects Version/s 0.20.1 [ 12314047 ]
          Affects Version/s 0.20.2 [ 12314205 ]
          Assignee Niels Basjes [ nielsbasjes ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Gian Merlino made changes -
          Attachment MAPREDUCE-2094-FileInputFormat-docs.patch [ 12654694 ]
          Gian Merlino made changes -
          Attachment MAPREDUCE-2094-FileInputFormat-docs.patch [ 12654694 ]
          Gian Merlino made changes -
          Niels Basjes made changes -
          Attachment MAPREDUCE-2094-20140727.patch [ 12658034 ]
          Niels Basjes made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Fixed splitting errors present in many FileInputFormat derivatives that do not override isSplitable. Throw an Exception in the most common error scenario present in many FileInputFormat derivatives that do not override isSplitable.
          Niels Basjes made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Niels Basjes made changes -
          Attachment MAPREDUCE-2094-20140727-svn.patch [ 12658039 ]
          Niels Basjes made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Niels Basjes made changes -
          Niels Basjes made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Niels Basjes made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Niels Basjes made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Niels Basjes made changes -
          Attachment MAPREDUCE-2094-2015-05-05-2328.patch [ 12730617 ]
          Niels Basjes made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Allen Wittenauer made changes -
          Labels BB2015-05-TBR
          Ray Chiang made changes -
          Labels BB2015-05-TBR
          Chris Douglas made changes -
          Attachment M2094.patch [ 12731592 ]
          Chris Douglas made changes -
          Summary org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. LineRecordReader should not seek into non-splittable, compressed streams.
          Chris Douglas made changes -
          Attachment M2094-1.patch [ 12731611 ]
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Release Note Throw an Exception in the most common error scenario present in many FileInputFormat derivatives that do not override isSplitable.
          Fix Version/s 2.8.0 [ 12329060 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Niels Basjes
              Reporter:
              Niels Basjes
            • Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development