Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1795

add error option if file-based record-readers fail to consume all input (e.g., concatenated gzip, bzip2)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When running MapReduce with concatenated gzip files as input, only the first part ("member" in gzip spec parlance, http://www.ietf.org/rfc/rfc1952.txt) is read; the remainder is silently ignored. As a first step toward fixing that, this issue will add a configurable option to throw an error in such cases.

      MAPREDUCE-469 is the tracker for the more complete fix/feature, whenever that occurs.

        Issue Links

          Activity

          Greg Roelofs created issue -
          Greg Roelofs made changes -
          Field Original Value New Value
          Link This issue is a clone of MAPREDUCE-469 [ MAPREDUCE-469 ]
          Greg Roelofs made changes -
          Link This issue is related to PIG-42 [ PIG-42 ]
          Greg Roelofs made changes -
          Link This issue is related to HADOOP-6335 [ HADOOP-6335 ]
          Greg Roelofs made changes -
          Original Estimate 336h [ 1209600 ]
          Remaining Estimate 336h [ 1209600 ]
          Affects Version/s 0.20.2 [ 12314205 ]
          Description When running MapReduce with concatenated gzip files as input only the first part is read, which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)
          When running MapReduce with concatenated gzip files as input, only the first part ("member" in gzip spec parlance, http://www.ietf.org/rfc/rfc1952.txt) is read; the remainder is silently ignored. As a first step toward fixing that, this issue will add a configurable option to throw an error in such cases.

          MAPREDUCE-469 is the tracker for the more complete fix/feature, whenever that occurs.
          Greg Roelofs made changes -
          Link This issue is related to MAPREDUCE-469 [ MAPREDUCE-469 ]
          Greg Roelofs made changes -
          Link This issue is a clone of MAPREDUCE-469 [ MAPREDUCE-469 ]
          Greg Roelofs made changes -
          Original Estimate 336h [ 1209600 ]
          Remaining Estimate 336h [ 1209600 ]
          Assignee Ravi Gummadi [ ravidotg ]
          Affects Version/s 0.20.2 [ 12314205 ]
          Greg Roelofs made changes -
          Assignee Greg Roelofs [ roelofs ]
          Hide
          Greg Roelofs added a comment -

          It appears that the initial target location for the fix, in LineRecordReader's next() method (0.20.x) or nextKeyValue() (trunk), isn't actually workable due to buffering. Ideally one would be able to check getFilePosition() after hitting the end of the first member/zlib-stream, notice that it's not equal to the end of file, and optionally throw an error. However, the file position, in general, is beyond the end of the zlib-stream, and for small concatenated inputs it may actually be at the end of file even though the logical offset isn't. There doesn't appear to be a way to get at the logical "stream offset" at this level, though if anyone is aware of a way, please let me know.

          In the meantime, we're planning to simply fix the bug (i.e., MAPREDUCE-469), at least for the native-zlib codec. A workaround for the Java-zlib alternative is in the 30-AUG-2006 comment on Sun's bug 4691425 (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425), but without any explicit license that would allow us to redistribute it as part of Hadoop. And bzip2 reportedly is already fixed on the trunk (HADOOP-4012).

          Barring any new information, I plan to resolve this issue as invalid.

          Show
          Greg Roelofs added a comment - It appears that the initial target location for the fix, in LineRecordReader's next() method (0.20.x) or nextKeyValue() (trunk), isn't actually workable due to buffering. Ideally one would be able to check getFilePosition() after hitting the end of the first member/zlib-stream, notice that it's not equal to the end of file, and optionally throw an error. However, the file position, in general, is beyond the end of the zlib-stream, and for small concatenated inputs it may actually be at the end of file even though the logical offset isn't. There doesn't appear to be a way to get at the logical "stream offset" at this level, though if anyone is aware of a way, please let me know. In the meantime, we're planning to simply fix the bug (i.e., MAPREDUCE-469 ), at least for the native-zlib codec. A workaround for the Java-zlib alternative is in the 30-AUG-2006 comment on Sun's bug 4691425 ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425 ), but without any explicit license that would allow us to redistribute it as part of Hadoop. And bzip2 reportedly is already fixed on the trunk ( HADOOP-4012 ). Barring any new information, I plan to resolve this issue as invalid.
          Hide
          Greg Roelofs added a comment -

          Per previous comment, we're going to fix the underlying issue instead (i.e., make decompressors support concatenated streams). See MAPREDUCE-469.

          Show
          Greg Roelofs added a comment - Per previous comment, we're going to fix the underlying issue instead (i.e., make decompressors support concatenated streams). See MAPREDUCE-469 .
          Greg Roelofs made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          23d 20h 38m 1 Greg Roelofs 10/Jun/10 22:57

            People

            • Assignee:
              Greg Roelofs
              Reporter:
              Greg Roelofs
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development