ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1520

A txn log record with a corrupt sentinel byte looks like EOF

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.3.5
    • Fix Version/s: 3.4.7, 3.5.0
    • Component/s: server
    • Labels:
    • Environment:

      all

      Description

      In Util.readTxnBytes() the sentinel is compared with 0x42 and if it does not match then the record is considered partially written and thus the EOF. However if it is a partial record the sentinel should be 0x00 since that is what the log is initialized with. Any other value would indicate corruption and should throw an IOException rather than indicate EOF. See ZOOKEEPER-1453 for a related issue.

      1. ZOOKEEPER-1520.patch
        11 kB
        Bill Bridge
      2. init.out
        0.4 kB
        Bill Bridge
      3. init.out
        0.4 kB
        Bill Bridge
      4. checkout.out
        85 kB
        Bill Bridge
      5. ant.out
        24 kB
        Bill Bridge

        Activity

        Bill Bridge created issue -
        Hide
        Bill Bridge added a comment -

        This is a simple localized fix with a low probability of causing any problems. I want to do this to learn the process for making ZooKeeper contributions.

        Show
        Bill Bridge added a comment - This is a simple localized fix with a low probability of causing any problems. I want to do this to learn the process for making ZooKeeper contributions.
        Hide
        Flavio Junqueira added a comment -

        Did you mean to upload a patch, Bill? Your proposal sounds reasonable to me.

        Show
        Flavio Junqueira added a comment - Did you mean to upload a patch, Bill? Your proposal sounds reasonable to me.
        Hide
        Bill Bridge added a comment -

        I intend to upload a patch as soon as I can build from an svn checkout, edit in eclipse, run the tests, and create a patch. This is an extremely complex process, and is not documented anywhere that I can find. Fortunately I have a friend that is an old hand at Java development. He is helping me find all the parts that are not in the trunk (e.g. eclipse, svn, ant, ivy, slf4j, log4j, ???), configuring eclipse to successfully compile, and getting the tests to run. At last report the tests were all failing due to some missing or misplaced component or parameter. I suspect our wandering in the dark is going to produce a different development environment than anyone else has. This will likely lead to future problems.

        Did I just not find the detailed description of how to take an empty machine that boots some freshly installed OS and turn it into a ZooKeeper development environment? It should start with which JDK to use (I chose jdk1.6.0), and where to download it.

        Show
        Bill Bridge added a comment - I intend to upload a patch as soon as I can build from an svn checkout, edit in eclipse, run the tests, and create a patch. This is an extremely complex process, and is not documented anywhere that I can find. Fortunately I have a friend that is an old hand at Java development. He is helping me find all the parts that are not in the trunk (e.g. eclipse, svn, ant, ivy, slf4j, log4j, ???), configuring eclipse to successfully compile, and getting the tests to run. At last report the tests were all failing due to some missing or misplaced component or parameter. I suspect our wandering in the dark is going to produce a different development environment than anyone else has. This will likely lead to future problems. Did I just not find the detailed description of how to take an empty machine that boots some freshly installed OS and turn it into a ZooKeeper development environment? It should start with which JDK to use (I chose jdk1.6.0), and where to download it.
        Hide
        Patrick Hunt added a comment -

        "ant test" is usually how we do it.

        Also "ant eclipse" and then open the project in eclipse itself.

        No need to download jars and such (slf/log/etc..) those will be d/l automatically by ant.

        If the "how to contribute" page is not sufficient please let us know: https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute

        Show
        Patrick Hunt added a comment - "ant test" is usually how we do it. Also "ant eclipse" and then open the project in eclipse itself. No need to download jars and such (slf/log/etc..) those will be d/l automatically by ant. If the "how to contribute" page is not sufficient please let us know: https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute
        Bill Bridge made changes -
        Field Original Value New Value
        Attachment init.out [ 12539722 ]
        Hide
        Bill Bridge added a comment -

        I have been unable to checkout the trunk and successfully run the tests. I am using a Solaris 10 SPARC machine, and I expect that is part of my problem.

        I eventually get the following error after "ant test" runs for 35 minutes:

        /zookeeper/zookeeper-trunk/build.xml:1240: Execute failed: java.io.IOException: Cannot run program "autoreconf" (in directory "/zookeeper/zookeeper-trunk/src/c"): error=2, No such file or directory

        Here are the commands I enter:

        bash-3.00$ rm -rf zookeeper-trunk/bash-3.00$ svn checkout http://svn.apache.org/repos/asf/zookeeper/trunk/ zookeeper-trunk &>checkout.out
        bash-3.00$ cd zookeeper-trunk/
        bash-3.00$ ant init &>../init.out
        bash-3.00$ ant test &>../ant.out

        I will attach the output files.

        The autoreconf file is really missing. It seems to be related to compiling C code to run natively, which is why I expect it is a SPARC problem.

        Am I doing something wrong or is development on SPARC Solaris really broken?

        I am very much a newbie to large scale Java development such as ZooKeeper. The instructions on the how to contribute page were not sufficient for me. I had previously downloaded zookeeper-3.3.5 and got it running under eclipse on windows. I found out the hard way that the procedure for a checked out version is quite different.

        It would be nice to list the tools that should be installed such as ivy, eclipse, ant, ... and where they can be found for download. It is particularly important to specify which version of the JDK should be used since using a 1.7 feature in ZooKeeper would be bad if users are running a 1.6 JRE. I still do not know which JDK is appropriate.

        A bit more explaination about how the build file works, and what it is capable of would be helpful. I have never used ant before so it took awhile to understand build.xml. A list of the high level targets, their intended use, and what they do would have told me that I did not need to find libraries like log4j online. I had never heard of ivy. Understanding why I should do something is better than just being told what to type.

        Show
        Bill Bridge added a comment - I have been unable to checkout the trunk and successfully run the tests. I am using a Solaris 10 SPARC machine, and I expect that is part of my problem. I eventually get the following error after "ant test" runs for 35 minutes: /zookeeper/zookeeper-trunk/build.xml:1240: Execute failed: java.io.IOException: Cannot run program "autoreconf" (in directory "/zookeeper/zookeeper-trunk/src/c"): error=2, No such file or directory Here are the commands I enter: bash-3.00$ rm -rf zookeeper-trunk/bash-3.00$ svn checkout http://svn.apache.org/repos/asf/zookeeper/trunk/ zookeeper-trunk &>checkout.out bash-3.00$ cd zookeeper-trunk/ bash-3.00$ ant init &>../init.out bash-3.00$ ant test &>../ant.out I will attach the output files. The autoreconf file is really missing. It seems to be related to compiling C code to run natively, which is why I expect it is a SPARC problem. Am I doing something wrong or is development on SPARC Solaris really broken? I am very much a newbie to large scale Java development such as ZooKeeper. The instructions on the how to contribute page were not sufficient for me. I had previously downloaded zookeeper-3.3.5 and got it running under eclipse on windows. I found out the hard way that the procedure for a checked out version is quite different. It would be nice to list the tools that should be installed such as ivy, eclipse, ant, ... and where they can be found for download. It is particularly important to specify which version of the JDK should be used since using a 1.7 feature in ZooKeeper would be bad if users are running a 1.6 JRE. I still do not know which JDK is appropriate. A bit more explaination about how the build file works, and what it is capable of would be helpful. I have never used ant before so it took awhile to understand build.xml. A list of the high level targets, their intended use, and what they do would have told me that I did not need to find libraries like log4j online. I had never heard of ivy. Understanding why I should do something is better than just being told what to type.
        Bill Bridge made changes -
        Attachment init.out [ 12539726 ]
        Bill Bridge made changes -
        Attachment checkout.out [ 12539727 ]
        Attachment ant.out [ 12539728 ]
        Hide
        Bill Bridge added a comment -

        I have attached a patch to throw an error if the sentinel byte is corrupt.Please review.

        Show
        Bill Bridge added a comment - I have attached a patch to throw an error if the sentinel byte is corrupt.Please review.
        Bill Bridge made changes -
        Attachment ZOOKEEPER-1520.patch [ 12541774 ]
        Hide
        Bill Bridge added a comment -

        Throws an error if the sentinel byte on a transaction record is corrupt.

        Show
        Bill Bridge added a comment - Throws an error if the sentinel byte on a transaction record is corrupt.
        Bill Bridge made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Labels newbie newbie patch
        Fix Version/s 3.3.6 [ 12320172 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12541774/ZOOKEEPER-1520.patch
        against trunk revision 1373156.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1165//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1165//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1165//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541774/ZOOKEEPER-1520.patch against trunk revision 1373156. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1165//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1165//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1165//console This message is automatically generated.
        Hide
        Bill Bridge added a comment -

        I moved to linux since I was unable to get the tests working on SPARC Solaris (See above comments). However that too did not really work. I am using a virtual machine that is assigned to me for Oracle development so the timing may be different than a physical machine.I ran "ant test" many times before making any changes and got many different errors on the tests. They all seem timing related. After creating my patch and testing it I get the same tests failing at about the same rate. For example I got the following errors when I ran the test suite 21 times:

        "server.quorum.Zab1_0Test FAILED" 16 times
        "server.ZxidRolloverTest FAILED" 1 time
        "test.AsyncHammerTest FAILED" 3 times
        "test.FLETest FAILED" 1 time
        "test.NioNettySuiteHammerTest FAILED" 3 times "(timeout)" on 2 of them
        "test.QuorumTest FAILED" 1 time
        "test.WatcherTest FAILED" 10 times

        There was one of the 21 runs where none of the test failed but the build failed due to a missing script in my checked out source tree (/scratch/zookeeper/Workspace/zookeeper-trunk/). The test output for this run ended with:

        create-cppunit-configure:
        [exec] checking for doxygen... /usr/bin/doxygen
        [exec] checking for perl... /usr/bin/perl
        [exec] checking for dot... no
        [exec] configure: error: cannot find install-sh or install.sh in /scratch/zookeeper/Workspace/zookeeper-trunk/src/c /scratch/zookeeper/Workspace/zookeeper-trunk/src/c/.. /scratch/zookeeper/Workspace/zookeeper-trunk/src/c/../..

        BUILD FAILED ...

        This looks like it has something to do with C code, which I did not touch, so I do not think this has anything to do with my patch.

        Is there something that can be done so I can reliably run the tests?

        Show
        Bill Bridge added a comment - I moved to linux since I was unable to get the tests working on SPARC Solaris (See above comments). However that too did not really work. I am using a virtual machine that is assigned to me for Oracle development so the timing may be different than a physical machine.I ran "ant test" many times before making any changes and got many different errors on the tests. They all seem timing related. After creating my patch and testing it I get the same tests failing at about the same rate. For example I got the following errors when I ran the test suite 21 times: "server.quorum.Zab1_0Test FAILED" 16 times "server.ZxidRolloverTest FAILED" 1 time "test.AsyncHammerTest FAILED" 3 times "test.FLETest FAILED" 1 time "test.NioNettySuiteHammerTest FAILED" 3 times "(timeout)" on 2 of them "test.QuorumTest FAILED" 1 time "test.WatcherTest FAILED" 10 times There was one of the 21 runs where none of the test failed but the build failed due to a missing script in my checked out source tree (/scratch/zookeeper/Workspace/zookeeper-trunk/). The test output for this run ended with: create-cppunit-configure: [exec] checking for doxygen... /usr/bin/doxygen [exec] checking for perl... /usr/bin/perl [exec] checking for dot... no [exec] configure: error: cannot find install-sh or install.sh in /scratch/zookeeper/Workspace/zookeeper-trunk/src/c /scratch/zookeeper/Workspace/zookeeper-trunk/src/c/.. /scratch/zookeeper/Workspace/zookeeper-trunk/src/c/../.. BUILD FAILED ... This looks like it has something to do with C code, which I did not touch, so I do not think this has anything to do with my patch. Is there something that can be done so I can reliably run the tests?
        Bill Bridge made changes -
        Attachment ZOOKEEPER-1520.patch [ 12542370 ]
        Hide
        Bill Bridge added a comment -

        Patch did not include test

        Show
        Bill Bridge added a comment - Patch did not include test
        Bill Bridge made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Bill Bridge made changes -
        Attachment ZOOKEEPER-1520.patch [ 12541774 ]
        Hide
        Bill Bridge added a comment -

        This treats a sentinel byte of zero as EOF, and throws an IOException if the sentinel byte is otherwise not 0x42. CRCTest was upgraded to test for all 4 types of corruption that the code looks for, including a corrupt sentinel byte.

        Show
        Bill Bridge added a comment - This treats a sentinel byte of zero as EOF, and throws an IOException if the sentinel byte is otherwise not 0x42. CRCTest was upgraded to test for all 4 types of corruption that the code looks for, including a corrupt sentinel byte.
        Bill Bridge made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12542370/ZOOKEEPER-1520.patch
        against trunk revision 1373156.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1170//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1170//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1170//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542370/ZOOKEEPER-1520.patch against trunk revision 1373156. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1170//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1170//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1170//console This message is automatically generated.
        Flavio Junqueira made changes -
        Assignee Bill Bridge [ bbridge ]
        Hide
        Flavio Junqueira added a comment -

        The patch looks good to me, thanks, Bill. I have a question about how to recover from such errors. With this patch, if there is a corrupt record, then we throw an exception, which I believe would prevent a server from starting. Does it mean that if we get a corrupt record on a given node, we won't be able to recover the data on the disk of that node?

        Show
        Flavio Junqueira added a comment - The patch looks good to me, thanks, Bill. I have a question about how to recover from such errors. With this patch, if there is a corrupt record, then we throw an exception, which I believe would prevent a server from starting. Does it mean that if we get a corrupt record on a given node, we won't be able to recover the data on the disk of that node?
        Hide
        Flavio Junqueira added a comment -

        Also, the sentinel being zero does not necessarily mean that there has been no corruption. I understand that it is a pretty good guess, though, because of the zeroing during preallocation. If it happens that there has been a corruption and we simply considered it as a partial record. Is it acceptable to have such a low probability corner case? I'm just pointing this out because the behavior wouldn't be consistent if we hit this case.

        Show
        Flavio Junqueira added a comment - Also, the sentinel being zero does not necessarily mean that there has been no corruption. I understand that it is a pretty good guess, though, because of the zeroing during preallocation. If it happens that there has been a corruption and we simply considered it as a partial record. Is it acceptable to have such a low probability corner case? I'm just pointing this out because the behavior wouldn't be consistent if we hit this case.
        Patrick Hunt made changes -
        Fix Version/s 3.4.7 [ 12325149 ]
        Fix Version/s 3.5.0 [ 12316644 ]
        Fix Version/s 3.3.6 [ 12320172 ]
        Hide
        Michi Mutsuzaki added a comment -

        The patch doesn't apply anymore.

        Show
        Michi Mutsuzaki added a comment - The patch doesn't apply anymore.
        Michi Mutsuzaki made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Michi Mutsuzaki added a comment -

        Sorry for jumping in this late, but to answer Flavio's question:

        Also, the sentinel being zero does not necessarily mean that there has been no corruption. I understand that it is a pretty good guess, though, because of the zeroing during preallocation. If it happens that there has been a corruption and we simply considered it as a partial record. Is it acceptable to have such a low probability corner case? I'm just pointing this out because the behavior wouldn't be consistent if we hit this case.

        I think we can check if the record is corrupted by validating the checksum, using 0x42 as the last byte instead of 0x0. By the way, it's kind of strange that readTxnBytes() doesn't validate the checksum at all.

        Show
        Michi Mutsuzaki added a comment - Sorry for jumping in this late, but to answer Flavio's question: Also, the sentinel being zero does not necessarily mean that there has been no corruption. I understand that it is a pretty good guess, though, because of the zeroing during preallocation. If it happens that there has been a corruption and we simply considered it as a partial record. Is it acceptable to have such a low probability corner case? I'm just pointing this out because the behavior wouldn't be consistent if we hit this case. I think we can check if the record is corrupted by validating the checksum, using 0x42 as the last byte instead of 0x0. By the way, it's kind of strange that readTxnBytes() doesn't validate the checksum at all.

          People

          • Assignee:
            Bill Bridge
            Reporter:
            Bill Bridge
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development