Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, the end of the SequenceFile header is a sync block that isn't prefaced with SYNC_ESCAPE. This means that sync(0) followed by next() fails.

      1. C6196-0.patch
        5 kB
        Chris Douglas
      2. C6196-test.patch
        10 kB
        Chris Douglas
      3. hadoop-6196-multipatch.txt
        6 kB
        Jay Booth
      4. sync-bug.patch
        5 kB
        Jay Booth

        Activity

        Hide
        Jay Booth added a comment -

        Here's the patch. To reproduce the problem, patch, then revert SequenceFile (to undo the fix) and run the attached test TestSequenceFileSync.

        Strangely, it works sometimes and breaks other times. When it works, it'll tend to work for a few times in a row until a few seconds pass and then it starts breaking again (EOFException at readInt). I don't get it, could be something weird with my environment, but the patch to SequenceFile means that it works all the time, so who cares

        The fix required bumping the version of SequenceFile from 6 to 7, writing the SYNC_ESCAPE int properly before writing the sync block at the end of the header file, and then scanning past that prior to loading sync when opening. Now sync(0); next(); works properly.

        Show
        Jay Booth added a comment - Here's the patch. To reproduce the problem, patch, then revert SequenceFile (to undo the fix) and run the attached test TestSequenceFileSync. Strangely, it works sometimes and breaks other times. When it works, it'll tend to work for a few times in a row until a few seconds pass and then it starts breaking again (EOFException at readInt). I don't get it, could be something weird with my environment, but the patch to SequenceFile means that it works all the time, so who cares The fix required bumping the version of SequenceFile from 6 to 7, writing the SYNC_ESCAPE int properly before writing the sync block at the end of the header file, and then scanning past that prior to loading sync when opening. Now sync(0); next(); works properly.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12416705/sync-bug.patch
        against trunk revision 804352.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416705/sync-bug.patch against trunk revision 804352. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/608/console This message is automatically generated.
        Hide
        Konstantin Boudnik added a comment -

        A few comments:

        • new unit test is developed under JUnit3 conventions (e,g, it extends TestCase, imports junit.framework, etc.) I'd suggest to make it JUnit4 compliant: remove TestCase extention, add @Test annotation to the test method, use org.junit.* packages instead
        • it seems that the unit test doesn't cover all the cases described in a comment above.

          Strangely, it works sometimes and breaks other times. When it works, it'll tend to work for a few times in a row until a few seconds pass and then it starts breaking again (EOFException at readInt)

          It seems to be more sensible to do a number of seek(0); next() followed by a random pauses and seek() calls to locations differ from 0

        • SequenceFile.java has some white-space only modifications, e.g.
          -      if (!blockCompressed) {
          +      if (!blockCompressed) {        
          
        Show
        Konstantin Boudnik added a comment - A few comments: new unit test is developed under JUnit3 conventions (e,g, it extends TestCase, imports junit.framework, etc.) I'd suggest to make it JUnit4 compliant: remove TestCase extention, add @Test annotation to the test method, use org.junit.* packages instead it seems that the unit test doesn't cover all the cases described in a comment above. Strangely, it works sometimes and breaks other times. When it works, it'll tend to work for a few times in a row until a few seconds pass and then it starts breaking again (EOFException at readInt) It seems to be more sensible to do a number of seek(0); next() followed by a random pauses and seek() calls to locations differ from 0 SequenceFile.java has some white-space only modifications, e.g. - if (!blockCompressed) { + if (!blockCompressed) {
        Hide
        Chris Douglas added a comment -

        Cancelling patch until Cos's comments are addressed

        Show
        Chris Douglas added a comment - Cancelling patch until Cos's comments are addressed
        Hide
        Jay Booth added a comment -

        Hey, will address this weekend, sorry for the slow turnaround

        Show
        Jay Booth added a comment - Hey, will address this weekend, sorry for the slow turnaround
        Hide
        Jay Booth added a comment -

        So, it turns out that SequenceFile.Writer generates its sync block and headers statically at classloader time, so if it generates the 'right' binary data, it'll evade this bug. I can't reproduce the bug reliably without running the test multiple times in a row by hand, unless someone knows a way to re-execute static code in a running jvm. It occurs about 2/3 of the time or so. Modifying SequenceFile to expose those values for writing by a test seems like overkill... should we just leave as is? If the build runs frequently enough, it'll catch regressions eventually, right? All the other ideas I have are hacky.

        Issues with JUnit 4 and whitespace in the patch have been addressed.

        Show
        Jay Booth added a comment - So, it turns out that SequenceFile.Writer generates its sync block and headers statically at classloader time, so if it generates the 'right' binary data, it'll evade this bug. I can't reproduce the bug reliably without running the test multiple times in a row by hand, unless someone knows a way to re-execute static code in a running jvm. It occurs about 2/3 of the time or so. Modifying SequenceFile to expose those values for writing by a test seems like overkill... should we just leave as is? If the build runs frequently enough, it'll catch regressions eventually, right? All the other ideas I have are hacky. Issues with JUnit 4 and whitespace in the patch have been addressed.
        Hide
        Jay Booth added a comment -

        I had an idea for the issue of needing JVM restart to duplicate the bug reliably, if I define 5 test cases calling a common method in the same file, it'll launch in a new JVM each time under Ant, right? I didn't see a way to run just 1 test under ant and the tests take all day to run on my machine, can anyone confirm?

        Anyways, here's a new patch that attempts to run the test 5 times in different JVM instances.

        Show
        Jay Booth added a comment - I had an idea for the issue of needing JVM restart to duplicate the bug reliably, if I define 5 test cases calling a common method in the same file, it'll launch in a new JVM each time under Ant, right? I didn't see a way to run just 1 test under ant and the tests take all day to run on my machine, can anyone confirm? Anyways, here's a new patch that attempts to run the test 5 times in different JVM instances.
        Hide
        Jay Booth added a comment -

        (sorry for spam), in eclipse JUnit it seems to work, test will pass 2-3 times and fail 2-3 times. If we're extremely lucky/unlucky, it'll squeeze out a 5 for 5 success but it's pretty rare.

        Show
        Jay Booth added a comment - (sorry for spam), in eclipse JUnit it seems to work, test will pass 2-3 times and fail 2-3 times. If we're extremely lucky/unlucky, it'll squeeze out a 5 for 5 success but it's pretty rare.
        Hide
        Chris Douglas added a comment -

        So, it turns out that SequenceFile.Writer generates its sync block and headers statically at classloader time, so if it generates the 'right' binary data, it'll evade this bug.

        The unit test doesn't validate the data it reads; examining the records read, when SequnceFile doesn't fail after a call to sync(0), it will return invalid records or skip records. It doesn't look like it evades the bug so much as it silently fails. The fix to the header will work, but we must also fix the Reader and backport that fix.

        The random behavior comes from the sync marker; after a sync(long), the reader backs up to the start of the marker, and reads it as the record length. Since it backs up to the word before the marker, it reads the last word written by the metadata (00 00 00 00 if the metadata are empty) as the record length, then the first bytes of the sync marker as the key. Whether this causes OOM, EOF, or no exception is as random as the sync marker and dependent on the key type. I've attached some trace debugging added to illustrate (and a modified version of Jay's test)

        Since this can cause silent failures and data loss- and because the SequenceFile format has been stable for over 2 years- I don't know that an incompatible change fixing this for future SequenceFiles is practical. In discussing this with Arun, the solution that seemed most appropriate and feasible was to record the end of the header in init() and adjust sync(long) calls within that range to the first record boundary. Tentatively, setting seenSync to true when syncing to within this range may also be necessary, but we'll need to study it.

        Very good catch, Jay

        Show
        Chris Douglas added a comment - So, it turns out that SequenceFile.Writer generates its sync block and headers statically at classloader time, so if it generates the 'right' binary data, it'll evade this bug. The unit test doesn't validate the data it reads; examining the records read, when SequnceFile doesn't fail after a call to sync(0), it will return invalid records or skip records. It doesn't look like it evades the bug so much as it silently fails. The fix to the header will work, but we must also fix the Reader and backport that fix. The random behavior comes from the sync marker; after a sync(long) , the reader backs up to the start of the marker, and reads it as the record length. Since it backs up to the word before the marker, it reads the last word written by the metadata (00 00 00 00 if the metadata are empty) as the record length, then the first bytes of the sync marker as the key. Whether this causes OOM, EOF, or no exception is as random as the sync marker and dependent on the key type. I've attached some trace debugging added to illustrate (and a modified version of Jay's test) Since this can cause silent failures and data loss- and because the SequenceFile format has been stable for over 2 years- I don't know that an incompatible change fixing this for future SequenceFiles is practical. In discussing this with Arun, the solution that seemed most appropriate and feasible was to record the end of the header in init() and adjust sync(long) calls within that range to the first record boundary. Tentatively, setting seenSync to true when syncing to within this range may also be necessary, but we'll need to study it. Very good catch, Jay
        Hide
        Chris Douglas added a comment -

        Attaching a patch with a tentative fix and test based on Jay's original

        Show
        Chris Douglas added a comment - Attaching a patch with a tentative fix and test based on Jay's original
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12418217/C6196-0.patch
        against trunk revision 809491.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418217/C6196-0.patch against trunk revision 809491. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/3/console This message is automatically generated.
        Hide
        Jay Booth added a comment -

        Hey Chris, thanks, that looks great, good idea to avoid the version change. I'm pretty sure setting syncSeen=true is necessary to do, but looks like that's in your patch, thanks for the help.

        Show
        Jay Booth added a comment - Hey Chris, thanks, that looks great, good idea to avoid the version change. I'm pretty sure setting syncSeen=true is necessary to do, but looks like that's in your patch, thanks for the help.
        Hide
        Jay Booth added a comment -

        Sorry, didn't realize I had to resolve this on my end. This will be fine for 0.21.0 now?

        Show
        Jay Booth added a comment - Sorry, didn't realize I had to resolve this on my end. This will be fine for 0.21.0 now?
        Hide
        Chris Douglas added a comment -

        No patch fixing this issue has been integrated into 0.21, yet; a committer needs to review and approve the patch before it can be pushed to the repository and we can resolve the issue as "fixed." I'm reopening the issue and marking it "patch available" again.

        Considering the pain of writing incompatible SequenceFiles, I'm leaning away from bumping the version and just going with the compatible change to the reader. If the modified Reader approach seems hackish or I'm overestimating the pain of bumping the version, I'm open to the alternate strategy, but unless we have formats that call sync(0) regularly it seems like a reasonable workaround.

        Show
        Chris Douglas added a comment - No patch fixing this issue has been integrated into 0.21, yet; a committer needs to review and approve the patch before it can be pushed to the repository and we can resolve the issue as "fixed." I'm reopening the issue and marking it "patch available" again. Considering the pain of writing incompatible SequenceFiles, I'm leaning away from bumping the version and just going with the compatible change to the reader. If the modified Reader approach seems hackish or I'm overestimating the pain of bumping the version, I'm open to the alternate strategy, but unless we have formats that call sync(0) regularly it seems like a reasonable workaround.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12418217/C6196-0.patch
        against trunk revision 811572.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418217/C6196-0.patch against trunk revision 811572. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/19/console This message is automatically generated.
        Hide
        Jay Booth added a comment -

        I agree, your fix seems relatively foolproof and sidesteps the version bump entirely. I've tested it pretty thoroughly in pseudo distributed and will be giving it a full workout over the next few days, in case that helps with anyone's confidence. I'd really like to see this go into 0.21, if possible, even if sync(0) isn't very common the class should still adhere to contract.

        Show
        Jay Booth added a comment - I agree, your fix seems relatively foolproof and sidesteps the version bump entirely. I've tested it pretty thoroughly in pseudo distributed and will be giving it a full workout over the next few days, in case that helps with anyone's confidence. I'd really like to see this go into 0.21, if possible, even if sync(0) isn't very common the class should still adhere to contract.
        Hide
        Arun C Murthy added a comment -

        +1

        Show
        Arun C Murthy added a comment - +1
        Hide
        Chris Douglas added a comment -

        I committed this. Thanks, Jay!

        Show
        Chris Douglas added a comment - I committed this. Thanks, Jay!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #27 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/27/)
        . Fix a bug in SequenceFile.Reader where syncing within the header
        would cause the reader to read the sync marker as a record. Contributed by Jay Booth

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #27 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/27/ ) . Fix a bug in SequenceFile.Reader where syncing within the header would cause the reader to read the sync marker as a record. Contributed by Jay Booth
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk #92 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/92/)
        . Fix a bug in SequenceFile.Reader where syncing within the header
        would cause the reader to read the sync marker as a record. Contributed by Jay Booth

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk #92 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/92/ ) . Fix a bug in SequenceFile.Reader where syncing within the header would cause the reader to read the sync marker as a record. Contributed by Jay Booth
        Hide
        Robert Chansler added a comment -

        Editorial pass over all release notes prior to publication of 0.21. just a bug.

        Show
        Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21. just a bug.

          People

          • Assignee:
            Jay Booth
            Reporter:
            Jay Booth
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development