Hadoop Common
  1. Hadoop Common
  2. HADOOP-7823

port HADOOP-4012 to branch-1 (splitting support for bzip2)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.205.0
    • Fix Version/s: 1.1.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Target Version/s:

      Description

      Please see HADOOP-4012 - Providing splitting support for bzip2 compressed files.

      1. HADOOP-7823-branch-1-v4.patch
        83 kB
        Andrew Purtell
      2. HADOOP-7823-branch-1-v3.patch
        89 kB
        Andrew Purtell
      3. HADOOP-7823-branch-1-v3.patch
        89 kB
        Andrew Purtell
      4. HADOOP-7823-branch-1-v2.patch
        86 kB
        Andrew Purtell
      5. HADOOP-7823-branch-1.patch
        84 kB
        Andrew Purtell

        Issue Links

          Activity

          Hide
          Andrew Purtell added a comment -

          Attached is a port of the most recent patch on 4012. TestCodec passes. Running full tests now.

          Show
          Andrew Purtell added a comment - Attached is a port of the most recent patch on 4012. TestCodec passes. Running full tests now.
          Hide
          Andrew Purtell added a comment -

          All tests pass except for TestUlimit#testCommandLine in the streaming contrib, which is an unrelated local failure due to a forked 'ulimit -v' not returning the expected output.

          Show
          Andrew Purtell added a comment - All tests pass except for TestUlimit#testCommandLine in the streaming contrib, which is an unrelated local failure due to a forked 'ulimit -v' not returning the expected output.
          Hide
          Andrew Purtell added a comment -

          This issue only requests a back port of 4012 to branch-1 but upon examination of the input formats in branch-1 it seems input format support for splittable codecs is missing. Is that work scoped in another JIRA? I'm currently back porting that from 0.23. Will put that up as another patch here.

          Show
          Andrew Purtell added a comment - This issue only requests a back port of 4012 to branch-1 but upon examination of the input formats in branch-1 it seems input format support for splittable codecs is missing. Is that work scoped in another JIRA? I'm currently back porting that from 0.23. Will put that up as another patch here.
          Hide
          Chris Douglas added a comment -

          This issue only requests a back port of 4012 to branch-1 but upon examination of the input formats in branch-1 it seems input format support for splittable codecs is missing. Is that work scoped in another JIRA? I'm currently back porting that from 0.23. Will put that up as another patch here.

          Did you include MAPREDUCE-830 and MAPREDUCE-772?

          Show
          Chris Douglas added a comment - This issue only requests a back port of 4012 to branch-1 but upon examination of the input formats in branch-1 it seems input format support for splittable codecs is missing. Is that work scoped in another JIRA? I'm currently back porting that from 0.23. Will put that up as another patch here. Did you include MAPREDUCE-830 and MAPREDUCE-772 ?
          Hide
          Andrew Purtell added a comment -

          Did you include MAPREDUCE-830 and MAPREDUCE-772?

          Yes to 772, yes to 830, but only in o.a.h.mapreduce. Will upload another patch with changes in o.a.h.mapred as well.

          Show
          Andrew Purtell added a comment - Did you include MAPREDUCE-830 and MAPREDUCE-772 ? Yes to 772, yes to 830, but only in o.a.h.mapreduce. Will upload another patch with changes in o.a.h.mapred as well.
          Hide
          Andrew Purtell added a comment -

          Attached is a patch that I think pulls it all together. TestCodec and TestTextInputFormat pass. Running full tests now.

          Show
          Andrew Purtell added a comment - Attached is a patch that I think pulls it all together. TestCodec and TestTextInputFormat pass. Running full tests now.
          Hide
          Andrew Purtell added a comment -

          v2 patch. Three changes to the first patch:

          1. Fixes a bug I introduced in o.a.h.mapred.LineRecordReader where the key would not be properly set if a line is too long and is skipped.

          2. o.a.h.mapred.LineRecordReader#getProgress should use getFilePosition, and therefore callers need to handle or declare they may throw IOE. This change was not in the patch for 772 or 830 but is present in 0.23.

          3. o.a.h.mapreduce.LineRecordReader#getProgress should use getFilePosition, also now throws IOE.

          All tests pass locally except for the previously reported unrelated failure.

          Show
          Andrew Purtell added a comment - v2 patch. Three changes to the first patch: 1. Fixes a bug I introduced in o.a.h.mapred.LineRecordReader where the key would not be properly set if a line is too long and is skipped. 2. o.a.h.mapred.LineRecordReader#getProgress should use getFilePosition, and therefore callers need to handle or declare they may throw IOE. This change was not in the patch for 772 or 830 but is present in 0.23. 3. o.a.h.mapreduce.LineRecordReader#getProgress should use getFilePosition, also now throws IOE. All tests pass locally except for the previously reported unrelated failure.
          Hide
          Andrew Purtell added a comment -

          A private Hudson instance found test failures that I missed in a local test run, due to changes missing from NLineInputFormat and TestMultipleCacheFiles.

          Apologies for the noise, this touched more places than expected.

          Show
          Andrew Purtell added a comment - A private Hudson instance found test failures that I missed in a local test run, due to changes missing from NLineInputFormat and TestMultipleCacheFiles. Apologies for the noise, this touched more places than expected.
          Hide
          Tim Broberg added a comment -

          I applied this patch to 1.0.0 and developed against it for a few days. The patch applied fine, everything compiled, and I was able to make use of the classes until I ran into HADOOP-8003.

          I did not exercise splittable Bzip itself, but for as much as I looked at I'm +1 on patch v3.

          Show
          Tim Broberg added a comment - I applied this patch to 1.0.0 and developed against it for a few days. The patch applied fine, everything compiled, and I was able to make use of the classes until I ran into HADOOP-8003 . I did not exercise splittable Bzip itself, but for as much as I looked at I'm +1 on patch v3.
          Hide
          Weili Shao added a comment -

          I applied patch v3 on hadoop 1.0.0, except a test file, the rest files were patched smoothly. However, I couldn't find any split implementation while a file is being compressed into bzip2 file format. So I am just looking for more information here. Does anyone know if a bzip2 file can be splitted with the patch, or we probably have to implement our own split method to really execute splitting? Thanks!

          Show
          Weili Shao added a comment - I applied patch v3 on hadoop 1.0.0, except a test file, the rest files were patched smoothly. However, I couldn't find any split implementation while a file is being compressed into bzip2 file format. So I am just looking for more information here. Does anyone know if a bzip2 file can be splitted with the patch, or we probably have to implement our own split method to really execute splitting? Thanks!
          Hide
          Harsh J added a comment -

          MAPREDUCE-830 is required for MR-side splitting of Bzip2 data.

          Show
          Harsh J added a comment - MAPREDUCE-830 is required for MR-side splitting of Bzip2 data.
          Hide
          Matt Foley added a comment -

          This patch seems to be ready. Can a committer familiar with the compression/decompression code please review this patch? Thanks.

          Show
          Matt Foley added a comment - This patch seems to be ready. Can a committer familiar with the compression/decompression code please review this patch? Thanks.
          Hide
          Chris Douglas added a comment -

          The patch should also include HADOOP-6925

          The rest of the code looks familiar (where did the NLineInputFormat change come from?). IIRC the unit test coverage is pretty good, but how else has this been verified?

          Show
          Chris Douglas added a comment - The patch should also include HADOOP-6925 The rest of the code looks familiar (where did the NLineInputFormat change come from?). IIRC the unit test coverage is pretty good, but how else has this been verified?
          Hide
          Andrew Purtell added a comment -

          The rest of the code looks familiar (where did the NLineInputFormat change come from?).

          I also did manual code inspection of 0.23, as well as followed JIRA tickets referenced by commenters on this issue.

          Will put up a v4 shortly that includes HADOOP-6925.

          Show
          Andrew Purtell added a comment - The rest of the code looks familiar (where did the NLineInputFormat change come from?). I also did manual code inspection of 0.23, as well as followed JIRA tickets referenced by commenters on this issue. Will put up a v4 shortly that includes HADOOP-6925 .
          Hide
          Andrew Purtell added a comment -

          v4 patch includes HADOOP-6925 fix to BZip2Codec. Modified TestCodec passes locally.

          Show
          Andrew Purtell added a comment - v4 patch includes HADOOP-6925 fix to BZip2Codec. Modified TestCodec passes locally.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12533144/HADOOP-7823-branch-1-v4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1136//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12533144/HADOOP-7823-branch-1-v4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1136//console This message is automatically generated.
          Hide
          Tim Broberg added a comment -

          +1 from me.

          Got clean branch-1.1, applied patch with no warnings or errors. Ran 10GB terasort with bzip-compressed input t0 test splitting. Got 10GB out which passed teravalidate.

          Show
          Tim Broberg added a comment - +1 from me. Got clean branch-1.1, applied patch with no warnings or errors. Ran 10GB terasort with bzip-compressed input t0 test splitting. Got 10GB out which passed teravalidate.
          Hide
          Chris Douglas added a comment -

          +1

          Committed to branch-1

          Thanks Tim for verifying the backport

          Show
          Chris Douglas added a comment - +1 Committed to branch-1 Thanks Tim for verifying the backport
          Hide
          Chris Douglas added a comment -

          Also committed to branch-1.1

          Show
          Chris Douglas added a comment - Also committed to branch-1.1
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.

            People

            • Assignee:
              Andrew Purtell
              Reporter:
              Tim Broberg
            • Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development