Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1795

add error option if file-based record-readers fail to consume all input (e.g., concatenated gzip, bzip2)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When running MapReduce with concatenated gzip files as input, only the first part ("member" in gzip spec parlance, http://www.ietf.org/rfc/rfc1952.txt) is read; the remainder is silently ignored. As a first step toward fixing that, this issue will add a configurable option to throw an error in such cases.

      MAPREDUCE-469 is the tracker for the more complete fix/feature, whenever that occurs.

        Issue Links

          Activity

          Hide
          Greg Roelofs added a comment -

          It appears that the initial target location for the fix, in LineRecordReader's next() method (0.20.x) or nextKeyValue() (trunk), isn't actually workable due to buffering. Ideally one would be able to check getFilePosition() after hitting the end of the first member/zlib-stream, notice that it's not equal to the end of file, and optionally throw an error. However, the file position, in general, is beyond the end of the zlib-stream, and for small concatenated inputs it may actually be at the end of file even though the logical offset isn't. There doesn't appear to be a way to get at the logical "stream offset" at this level, though if anyone is aware of a way, please let me know.

          In the meantime, we're planning to simply fix the bug (i.e., MAPREDUCE-469), at least for the native-zlib codec. A workaround for the Java-zlib alternative is in the 30-AUG-2006 comment on Sun's bug 4691425 (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425), but without any explicit license that would allow us to redistribute it as part of Hadoop. And bzip2 reportedly is already fixed on the trunk (HADOOP-4012).

          Barring any new information, I plan to resolve this issue as invalid.

          Show
          Greg Roelofs added a comment - It appears that the initial target location for the fix, in LineRecordReader's next() method (0.20.x) or nextKeyValue() (trunk), isn't actually workable due to buffering. Ideally one would be able to check getFilePosition() after hitting the end of the first member/zlib-stream, notice that it's not equal to the end of file, and optionally throw an error. However, the file position, in general, is beyond the end of the zlib-stream, and for small concatenated inputs it may actually be at the end of file even though the logical offset isn't. There doesn't appear to be a way to get at the logical "stream offset" at this level, though if anyone is aware of a way, please let me know. In the meantime, we're planning to simply fix the bug (i.e., MAPREDUCE-469 ), at least for the native-zlib codec. A workaround for the Java-zlib alternative is in the 30-AUG-2006 comment on Sun's bug 4691425 ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425 ), but without any explicit license that would allow us to redistribute it as part of Hadoop. And bzip2 reportedly is already fixed on the trunk ( HADOOP-4012 ). Barring any new information, I plan to resolve this issue as invalid.
          Hide
          Greg Roelofs added a comment -

          Per previous comment, we're going to fix the underlying issue instead (i.e., make decompressors support concatenated streams). See MAPREDUCE-469.

          Show
          Greg Roelofs added a comment - Per previous comment, we're going to fix the underlying issue instead (i.e., make decompressors support concatenated streams). See MAPREDUCE-469 .

            People

            • Assignee:
              Greg Roelofs
              Reporter:
              Greg Roelofs
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development