Hadoop Common
  1. Hadoop Common
  2. HADOOP-5326

bzip2 codec (CBZip2OutputStream) creates corrupted output file for some inputs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.21.0
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Bzip2 codec generated corrupted output files in some test executions I performed. This bug is probably related to https://issues.apache.org/bugzilla/show_bug.cgi?id=41596.

      • In my case, the problem seems to be at the BWT (Burrows-Wheeler Transform) implementation.
      1. HADOOP-5326.patch
        0.7 kB
        Rodrigo Schmidt
      2. HADOOP-5326.2.patch
        0.7 kB
        Rodrigo Schmidt

        Issue Links

          Activity

          Hide
          Rodrigo Schmidt added a comment - - edited

          I found the problem, corrected it, and tested with my data and the data available in https://issues.apache.org/bugzilla/show_bug.cgi?id=41596

          Show
          Rodrigo Schmidt added a comment - - edited I found the problem, corrected it, and tested with my data and the data available in https://issues.apache.org/bugzilla/show_bug.cgi?id=41596
          Hide
          Zheng Shao added a comment -

          nitpick:

          The original code does only one "+" operation, while the new code requires many (depending on how compiler optimizes it).

          What about a code block like this?

          int i = lastShadow + NUM_OVERSHOOT_BYTES;
          do {
            quadrant[i] = 0;
          } while (--i > 0);
          
          Show
          Zheng Shao added a comment - nitpick: The original code does only one "+" operation, while the new code requires many (depending on how compiler optimizes it). What about a code block like this? int i = lastShadow + NUM_OVERSHOOT_BYTES; do { quadrant[i] = 0; } while (--i > 0);
          Hide
          Rodrigo Schmidt added a comment -

          I doubt the compiler doesn't optimize that. Besides, this is just an initialization loop, executed once per block.

          I can rewrite it so that it looks closer to the original. I don't like this style, though.

              for (int i = lastShadow + NUM_OVERSHOOT_BYTES +1; --i >= 0;) {
                 quadrant[i] = 0;
               }
          
          Show
          Rodrigo Schmidt added a comment - I doubt the compiler doesn't optimize that. Besides, this is just an initialization loop, executed once per block. I can rewrite it so that it looks closer to the original. I don't like this style, though. for (int i = lastShadow + NUM_OVERSHOOT_BYTES +1; --i >= 0;) { quadrant[i] = 0; }
          Hide
          Rodrigo Schmidt added a comment -

          New patch contains modified and tested code.

          Show
          Rodrigo Schmidt added a comment - New patch contains modified and tested code.
          Hide
          Zheng Shao added a comment -

          +1

          Show
          Zheng Shao added a comment - +1
          Hide
          Zheng Shao added a comment -

          Thanks Rodrigo!

          Show
          Zheng Shao added a comment - Thanks Rodrigo!
          Hide
          Nigel Daley added a comment -

          Rodrigo, why no unit test?

          Zheng, why didn't you wait for +1 from Hudson?

          Show
          Nigel Daley added a comment - Rodrigo, why no unit test? Zheng, why didn't you wait for +1 from Hudson?
          Hide
          Rodrigo Schmidt added a comment -

          Hi Nigel,

          Thanks for the feedback.

          I assumed the bzip2 codec already had meaningful unit tests. As you can see in the patch, it corrects a very simple bug and I didn't think that adding a specific test for it would be any helpful/useful (unless someone deletes the +1 I added to the loop initialization).

          Besides, the data that triggered the bug and I was using for testing cannot be disclosed.

          Cheers,
          Rodrigo

          Show
          Rodrigo Schmidt added a comment - Hi Nigel, Thanks for the feedback. I assumed the bzip2 codec already had meaningful unit tests. As you can see in the patch, it corrects a very simple bug and I didn't think that adding a specific test for it would be any helpful/useful (unless someone deletes the +1 I added to the loop initialization). Besides, the data that triggered the bug and I was using for testing cannot be disclosed. Cheers, Rodrigo
          Hide
          Giridharan Kesavan added a comment -

          Could you resubmit the patch ?
          When hudson triggered the patch build, jira was down so hudson coudn't pick up the patch from jira..

          Thanks

          Show
          Giridharan Kesavan added a comment - Could you resubmit the patch ? When hudson triggered the patch build, jira was down so hudson coudn't pick up the patch from jira.. Thanks
          Hide
          Rodrigo Schmidt added a comment -

          reopening the issue to resubmit the patch.

          Show
          Rodrigo Schmidt added a comment - reopening the issue to resubmit the patch.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #767 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/767/)
          . Fixes CBZip2OutputStream data corruption problem. (Rodrigo Schmidt via zshao)

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #767 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/767/ ) . Fixes CBZip2OutputStream data corruption problem. (Rodrigo Schmidt via zshao)
          Hide
          Zheng Shao added a comment -

          Nigel, sorry I should have included the hadoop unit test results in my comments. We did test the patch a lot internally here. Also I ran the hadoop unit test myself.

          The reason for not adding a separate test is just like Rodrigo has said. The data that corrupts the current implementation is about 1MB in size but we cannot disclose it. There is another public data set that breaks the old code, but it is about 20MB in size. I don't think we want to include that big amount of data into the hadoop codebase.

          Also as you can see, the patch is just 2 bytes inside the BZip2 algorithm itself, literally.

          I will definitely be more carefully next time.

          Show
          Zheng Shao added a comment - Nigel, sorry I should have included the hadoop unit test results in my comments. We did test the patch a lot internally here. Also I ran the hadoop unit test myself. The reason for not adding a separate test is just like Rodrigo has said. The data that corrupts the current implementation is about 1MB in size but we cannot disclose it. There is another public data set that breaks the old code, but it is about 20MB in size. I don't think we want to include that big amount of data into the hadoop codebase. Also as you can see, the patch is just 2 bytes inside the BZip2 algorithm itself, literally. I will definitely be more carefully next time.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12401072/HADOOP-5326.2.patch
          against trunk revision 748861.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/21/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12401072/HADOOP-5326.2.patch against trunk revision 748861. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/21/console This message is automatically generated.

            People

            • Assignee:
              Rodrigo Schmidt
              Reporter:
              Rodrigo Schmidt
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development