Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-1621

ZooKeeper does not recover from crash when disk was full

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.4.3
    • Fix Version/s: 3.5.4, 3.6.0
    • Component/s: server
    • Labels:
      None
    • Environment:

      Ubuntu 12.04, Amazon EC2 instance

      Description

      The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception

      2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting
      java.io.IOException: No space left on device
      at java.io.FileOutputStream.writeBytes(Native Method)
      at java.io.FileOutputStream.write(FileOutputStream.java:282)
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
      at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
      at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
      at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
      at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162)
      at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)

      Then many subsequent exceptions like:

      2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial.
      2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally
      java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:375)
      at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
      at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
      at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
      at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
      at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
      at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
      at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
      at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
      at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
      at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
      at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
      at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
      at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
      at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
      at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
      at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
      at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
      at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
      at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

      It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case?

      1. zookeeper.log.gz
        129 kB
        David Arthur
      2. ZOOKEEPER-1621.2.patch
        9 kB
        Abhishek Rai
      3. ZOOKEEPER-1621.patch
        8 kB
        Michi Mutsuzaki

        Activity

        Hide
        mumrah David Arthur added a comment -

        I was able to workaround the issue by deleting the partially written snapshot file

        Show
        mumrah David Arthur added a comment - I was able to workaround the issue by deleting the partially written snapshot file
        Hide
        fpj Flavio Junqueira added a comment -

        I believe the exception is being thrown while reading the snapshot and the partial transaction message is not an indication of what is causing it to crash. It sounds right that we should try a different snapshot, but according to the log messages you posted, it sounds like the problem is that we are not catching EOFException.

        Show
        fpj Flavio Junqueira added a comment - I believe the exception is being thrown while reading the snapshot and the partial transaction message is not an indication of what is causing it to crash. It sounds right that we should try a different snapshot, but according to the log messages you posted, it sounds like the problem is that we are not catching EOFException.
        Hide
        mahadev Mahadev konar added a comment -

        David,
        So there exceptions are thrown when ZooKeeper is running? Am not sure why its exiting so many times. Do you guys restart the ZK server if it dies?

        Show
        mahadev Mahadev konar added a comment - David, So there exceptions are thrown when ZooKeeper is running? Am not sure why its exiting so many times. Do you guys restart the ZK server if it dies?
        Hide
        mumrah David Arthur added a comment -

        We run ZooKeeper with runit, so yes it is restarted when it dies. It ends up in a loop of:

        • No space left on device
        • Starting server
        • Last transaction was partial
        • Snapshotting: 0x19a3d to /opt/zookeeper-3.4.3/data/version-2/snapshot.19a3d
        • No space left on device
        Show
        mumrah David Arthur added a comment - We run ZooKeeper with runit, so yes it is restarted when it dies. It ends up in a loop of: No space left on device Starting server Last transaction was partial Snapshotting: 0x19a3d to /opt/zookeeper-3.4.3/data/version-2/snapshot.19a3d No space left on device
        Hide
        mahadev Mahadev konar added a comment -

        David,
        I thought you said it does not recover when disk was full, but looks like the disk is still full? No?

        Show
        mahadev Mahadev konar added a comment - David, I thought you said it does not recover when disk was full, but looks like the disk is still full? No?
        Hide
        mumrah David Arthur added a comment -

        Here is the full sequence of events (sorry for the confusion):

        • Noticed disk was full
        • Cleaned up disk space
        • Tried zkCli.sh, got errors
        • Checked ZK log, loop of:

        2013-01-16 15:01:35,194 - ERROR [main:Util@239] - Last transaction was partial.
        2013-01-16 15:01:35,196 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally
        java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
        at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
        at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
        at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
        at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
        at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
        at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
        at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
        at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

        • Stopped ZK
        • Listed ZK data directory

        ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ ls -lat
        total 18096
        drwxr-xr-x 2 zookeeper zookeeper 4096 Jan 16 06:41 .
        rw-rr- 1 zookeeper zookeeper 0 Jan 16 06:41 log.19a3e
        rw-rr- 1 zookeeper zookeeper 585377 Jan 16 06:41 snapshot.19a3d
        rw-rr- 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.19a2a
        rw-rr- 1 zookeeper zookeeper 585911 Jan 16 03:11 snapshot.19a29
        rw-rr- 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.11549
        rw-rr- 1 zookeeper zookeeper 585190 Jan 15 17:28 snapshot.11547
        rw-rr- 1 zookeeper zookeeper 67108880 Jan 15 17:28 log.1
        rw-rr- 1 zookeeper zookeeper 296 Jan 14 16:44 snapshot.0
        drwxr-xr-x 3 zookeeper zookeeper 4096 Jan 14 16:44 ..

        • Removed log.19a3e and snapshot.19a3d

        ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm log.19a3e
        ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm snapshot.19a3d

        • Started ZK
        • Back to normal
        Show
        mumrah David Arthur added a comment - Here is the full sequence of events (sorry for the confusion): Noticed disk was full Cleaned up disk space Tried zkCli.sh, got errors Checked ZK log, loop of: 2013-01-16 15:01:35,194 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:01:35,196 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) Stopped ZK Listed ZK data directory ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ ls -lat total 18096 drwxr-xr-x 2 zookeeper zookeeper 4096 Jan 16 06:41 . rw-r r - 1 zookeeper zookeeper 0 Jan 16 06:41 log.19a3e rw-r r - 1 zookeeper zookeeper 585377 Jan 16 06:41 snapshot.19a3d rw-r r - 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.19a2a rw-r r - 1 zookeeper zookeeper 585911 Jan 16 03:11 snapshot.19a29 rw-r r - 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.11549 rw-r r - 1 zookeeper zookeeper 585190 Jan 15 17:28 snapshot.11547 rw-r r - 1 zookeeper zookeeper 67108880 Jan 15 17:28 log.1 rw-r r - 1 zookeeper zookeeper 296 Jan 14 16:44 snapshot.0 drwxr-xr-x 3 zookeeper zookeeper 4096 Jan 14 16:44 .. Removed log.19a3e and snapshot.19a3d ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm log.19a3e ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm snapshot.19a3d Started ZK Back to normal
        Hide
        mumrah David Arthur added a comment -

        Attaching zookeeper.log

        Show
        mumrah David Arthur added a comment - Attaching zookeeper.log
        Hide
        eribeiro Edward Ribeiro added a comment -

        Hi folks,

        FYI, this issue is a duplication of ZOOKEEPER-1612 (curiously, a permutation of the last two digits, heh). I'd suggest to close 1612 as dup instead, if possible.

        Show
        eribeiro Edward Ribeiro added a comment - Hi folks, FYI, this issue is a duplication of ZOOKEEPER-1612 (curiously, a permutation of the last two digits, heh). I'd suggest to close 1612 as dup instead, if possible.
        Hide
        mahadev Mahadev konar added a comment -

        Ill makr 1612 as dup. Thanks for pointing that out Edward.

        Show
        mahadev Mahadev konar added a comment - Ill makr 1612 as dup. Thanks for pointing that out Edward.
        Hide
        mahadev Mahadev konar added a comment -

        Looks like the header was incomplete. Unfortunately we do not handle corrupt header but do handle corrupt txn's later. Am suprised that this happened twice in a row for 2 users. Ill upload a patch and test case.

        Show
        mahadev Mahadev konar added a comment - Looks like the header was incomplete. Unfortunately we do not handle corrupt header but do handle corrupt txn's later. Am suprised that this happened twice in a row for 2 users. Ill upload a patch and test case.
        Hide
        michim Michi Mutsuzaki added a comment -

        Should FileTxnIterator.goToNextLog() return false if the header is corrupted/incomplete, or should it skip the log file and go to the next log file if it exists?

        Show
        michim Michi Mutsuzaki added a comment - Should FileTxnIterator.goToNextLog() return false if the header is corrupted/incomplete, or should it skip the log file and go to the next log file if it exists?
        Show
        michim Michi Mutsuzaki added a comment - https://reviews.apache.org/r/21732/
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch
        against trunk revision 1596284.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch against trunk revision 1596284. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//console This message is automatically generated.
        Hide
        shralex Alexander Shraer added a comment -

        Here's a different option - intuitively once zookeeper fails to write to disk, by continuing to operate normally it violates its promises to users (which is that if a majority acked the data is always there even if reboots happen). Once we realize the promise can't be kept it may be better to crash the server at that point and violate liveness (no availability) rather than to continue and risk coming up with a partial log at a later point violating safety (inconsistent state, lost transactions, etc).

        Show
        shralex Alexander Shraer added a comment - Here's a different option - intuitively once zookeeper fails to write to disk, by continuing to operate normally it violates its promises to users (which is that if a majority acked the data is always there even if reboots happen). Once we realize the promise can't be kept it may be better to crash the server at that point and violate liveness (no availability) rather than to continue and risk coming up with a partial log at a later point violating safety (inconsistent state, lost transactions, etc).
        Hide
        michim Michi Mutsuzaki added a comment -

        I'm fine with Alex's suggestion. We should document how to manually recover when the server doesn't start because the log file doesn't contain the complete header.

        Show
        michim Michi Mutsuzaki added a comment - I'm fine with Alex's suggestion. We should document how to manually recover when the server doesn't start because the log file doesn't contain the complete header.
        Hide
        mumrah David Arthur added a comment -

        I actually like Alexander Shraer's suggestion. However, if this is going to be the way you recommended recovering a corrupt log file, there should be a script that does it for users: zk-recover.sh or some such. In this world of deployment automation, it's not a nice thing to say "go delete the most recent log segment from ZK's data dir". Much better for the application to handle it through a script or command.

        Show
        mumrah David Arthur added a comment - I actually like Alexander Shraer 's suggestion. However, if this is going to be the way you recommended recovering a corrupt log file, there should be a script that does it for users: zk-recover.sh or some such. In this world of deployment automation, it's not a nice thing to say "go delete the most recent log segment from ZK's data dir". Much better for the application to handle it through a script or command.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch
        against trunk revision 1662055.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2538//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2538//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2538//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch against trunk revision 1662055. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2538//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2538//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2538//console This message is automatically generated.
        Hide
        arshad.mohammad Mohammad Arshad added a comment -

        Apart from these corrective measures there should be some preventive measures as well.
        Can we have disk space availability checker which check periodically whether disk space is available or not and if not available then close the Zookeeper gracefully.

        Show
        arshad.mohammad Mohammad Arshad added a comment - Apart from these corrective measures there should be some preventive measures as well. Can we have disk space availability checker which check periodically whether disk space is available or not and if not available then close the Zookeeper gracefully.
        Hide
        rgs Raul Gutierrez Segales added a comment -

        You mean, like a ZK thread dedicated to this? What would the behavior be, only shutdown if it's the leader?

        Show
        rgs Raul Gutierrez Segales added a comment - You mean, like a ZK thread dedicated to this? What would the behavior be, only shutdown if it's the leader?
        Hide
        arshad.mohammad Mohammad Arshad added a comment -
        • Yes, dedicated thread for this like org.apache.zookeeper.server.DatadirCleanupManager
        • shut-down in every case, because without disk space zookeeper can not serve any purpose
        • The idea is as follows
          • add two new zookeeper properties
            diskspace.min.threshold=5% (values can be % of data directory available space or in GB)
            diskspace.check.interval=5 second (default:5,min:1,max:Long.MAX_VALUE)
          • add dedicated disk check thread
            • which runs on every {{diskspace.check.interval)) second
            • if disk space is less than diskspace.min.threshold then shutdown zookeeper instance
        • Some clarifications:
          • Query: Suppose diskspace.check.interval=5 and disk space can be full within 5 second by zookeeper or by other process. What is handling for this?
            Ans: User should know what is their usage scenario, and what other processes are using the same disk space and based on that they should optimize the diskspace.check.interval values
          • Query: let say diskspace.check.interval=1 but disk space can be filled even within 1 second by zookeeper and other process
            Ans: yes it can be filled if diskspace.min.threshold is less, again based on disk space usage user need to optimize diskspace.min.threshold
        Show
        arshad.mohammad Mohammad Arshad added a comment - Yes, dedicated thread for this like org.apache.zookeeper.server.DatadirCleanupManager shut-down in every case, because without disk space zookeeper can not serve any purpose The idea is as follows add two new zookeeper properties diskspace.min.threshold=5% (values can be % of data directory available space or in GB) diskspace.check.interval=5 second (default:5,min:1,max:Long.MAX_VALUE) add dedicated disk check thread which runs on every {{diskspace.check.interval)) second if disk space is less than diskspace.min.threshold then shutdown zookeeper instance Some clarifications: Query: Suppose diskspace.check.interval=5 and disk space can be full within 5 second by zookeeper or by other process. What is handling for this? Ans: User should know what is their usage scenario, and what other processes are using the same disk space and based on that they should optimize the diskspace.check.interval values Query: let say diskspace.check.interval=1 but disk space can be filled even within 1 second by zookeeper and other process Ans: yes it can be filled if diskspace.min.threshold is less, again based on disk space usage user need to optimize diskspace.min.threshold
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch
        against trunk revision 1697227.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2834//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2834//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2834//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch against trunk revision 1697227. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2834//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2834//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2834//console This message is automatically generated.
        Hide
        arshad.mohammad Mohammad Arshad added a comment -

        Hi Raul Gutierrez Segales, does it make sense, can we create new jira for this

        Show
        arshad.mohammad Mohammad Arshad added a comment - Hi Raul Gutierrez Segales , does it make sense, can we create new jira for this
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch
        against trunk revision 1748630.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to cause Findbugs (version 2.0.3) to fail.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3215//testReport/
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3215//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch against trunk revision 1748630. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs (version 2.0.3) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3215//testReport/ Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3215//console This message is automatically generated.
        Hide
        abhishekrai Abhishek Rai added a comment -

        Reviving this old thread. Alexander Shraer has a valid concern about trading off consistency for availability. However, for the specific issue being addressed here, we can have both.

        The patch skips transaction logs with an incomplete header (the first 16 bytes). Skipping such files should not cause any loss of data as the header is an internal bookkeeping write from Zookeeper and does not contain any user data. This avoids the current behavior of Zookeeper crashing on encountering an incomplete header, which compromises availability.

        This has been a recurring problem for us in production because our app's operating environment occasionally causes a Zookeeper server's disk to become full. After that, the server invariably runs into this problem - perhaps because there's something else that deterministically triggers a log rotation when the previous txn log throws an IOException due to disk full?

        That said, we can tighten the exception being caught in Michi Mutsuzaki's patch to EOFException instead of IOException to make sure that the log we are skipping indeed only has a partially written header and nothing else (in FileTxnLog.goToNextLog).

        Additionally, I have written a test to verify that EOFException is thrown if and only if the header is truncated. Zookeeper already ignores any other partially written transactions in the txn log. If that's useful, I can upload the test, thanks.

        Show
        abhishekrai Abhishek Rai added a comment - Reviving this old thread. Alexander Shraer has a valid concern about trading off consistency for availability. However, for the specific issue being addressed here, we can have both. The patch skips transaction logs with an incomplete header (the first 16 bytes). Skipping such files should not cause any loss of data as the header is an internal bookkeeping write from Zookeeper and does not contain any user data. This avoids the current behavior of Zookeeper crashing on encountering an incomplete header, which compromises availability. This has been a recurring problem for us in production because our app's operating environment occasionally causes a Zookeeper server's disk to become full. After that, the server invariably runs into this problem - perhaps because there's something else that deterministically triggers a log rotation when the previous txn log throws an IOException due to disk full? That said, we can tighten the exception being caught in Michi Mutsuzaki 's patch to EOFException instead of IOException to make sure that the log we are skipping indeed only has a partially written header and nothing else (in FileTxnLog.goToNextLog). Additionally, I have written a test to verify that EOFException is thrown if and only if the header is truncated. Zookeeper already ignores any other partially written transactions in the txn log. If that's useful, I can upload the test, thanks.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch
        against trunk revision df5519ab9dac9940f35cc4b308b560f2603aec7f.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The patch appears to cause tar ant target to fail.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3476//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3476//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3476//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch against trunk revision df5519ab9dac9940f35cc4b308b560f2603aec7f. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3476//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3476//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3476//console This message is automatically generated.
        Hide
        mkizner Meyer Kizner added a comment -

        Agreed. Forcing users to manually clean up the partial/empty header in this scenario seems undesirable, and if we only catch EOFException instead of IOException, we shouldn't run into any problems with correctness. Additionally, since this issue should only occur "legitimately" in the most recent txn log file, we can be even more conservative and only continue in that case.

        Show
        mkizner Meyer Kizner added a comment - Agreed. Forcing users to manually clean up the partial/empty header in this scenario seems undesirable, and if we only catch EOFException instead of IOException, we shouldn't run into any problems with correctness. Additionally, since this issue should only occur "legitimately" in the most recent txn log file, we can be even more conservative and only continue in that case.
        Hide
        abhishekrai Abhishek Rai added a comment -

        Thanks Meyer Kizner. Your suggestion of doing this only for the most recent txn log file is sound. Are you also suggesting that we delete this truncated txn log file?

        Cause, if we skip it and don't delete, then in the future, newer txn log files will get created. So, the truncated txn log file will no longer be the latest txn log when we do a purge afterwards.

        Deletion seems consistent with this approach as well as consistent with PurgeTxnLog's behavior.

        Show
        abhishekrai Abhishek Rai added a comment - Thanks Meyer Kizner . Your suggestion of doing this only for the most recent txn log file is sound. Are you also suggesting that we delete this truncated txn log file? Cause, if we skip it and don't delete, then in the future, newer txn log files will get created. So, the truncated txn log file will no longer be the latest txn log when we do a purge afterwards. Deletion seems consistent with this approach as well as consistent with PurgeTxnLog's behavior.
        Hide
        mkizner Meyer Kizner added a comment -

        Yes, we would have to delete such a log file upon encountering it. I don't believe this would cause any problems, and it seems desirable to have the extra check this enables.

        Show
        mkizner Meyer Kizner added a comment - Yes, we would have to delete such a log file upon encountering it. I don't believe this would cause any problems, and it seems desirable to have the extra check this enables.
        Hide
        hanm Michael Han added a comment -

        The proposal of the fix makes sense to me.

        Is it feasible to make a stronger guarantee for the ZooKeeper serialization semantics - that is, under no cases (disk full, power failure, hardware failure) would ZooKeeper generates invalid persistent files (for both snapshot and tx logs)? This might be possible by serializing things to a swap file first and then at one point do an atomic rename of the file. With the guarantee of the sanity of the on disk formats the deserializing logic would be simplified, as there will not be many corner cases to consider, besides the existing basic checksum check logic.

        I can think two potential drawback of this approach:

        • Performance: if we write to swap file and then rename for every writes, we will be making more sys calls per write. Might impact performance / latency of write?
        • Potential data loss during recover: to improve performance, we could batch writes and only do rename at certain points - (i.e. every 1000 writes). In case of a failure, part of the data might loss as those data (possibly corrupted / partially serialized) living in swap file will not be parsed by ZK during start up (we will only load and parse renamed files.).

        My feeling is the best approach might be a mix of efforts on both serialization and deserialization side:

        • When serializing, we do our best efforts to avoid generate corrupted files (i.e. through atomic writes to files.).
        • When deserializing, we do best efforts to detect corrupt files and recover conservatively - the success of recovery might be case by case - for example for this disk full case the proposed fix sounds pretty safe to perform while in other cases it might not be straightforward to tell which data is good and which is bad.
        • As a result - the expectation is when things crash and files corrupted, ZK should be able to recover later without manual intervention. This would be good for users.
        Show
        hanm Michael Han added a comment - The proposal of the fix makes sense to me. Is it feasible to make a stronger guarantee for the ZooKeeper serialization semantics - that is, under no cases (disk full, power failure, hardware failure) would ZooKeeper generates invalid persistent files (for both snapshot and tx logs)? This might be possible by serializing things to a swap file first and then at one point do an atomic rename of the file. With the guarantee of the sanity of the on disk formats the deserializing logic would be simplified, as there will not be many corner cases to consider, besides the existing basic checksum check logic. I can think two potential drawback of this approach: Performance: if we write to swap file and then rename for every writes, we will be making more sys calls per write. Might impact performance / latency of write? Potential data loss during recover: to improve performance, we could batch writes and only do rename at certain points - (i.e. every 1000 writes). In case of a failure, part of the data might loss as those data (possibly corrupted / partially serialized) living in swap file will not be parsed by ZK during start up (we will only load and parse renamed files.). My feeling is the best approach might be a mix of efforts on both serialization and deserialization side: When serializing, we do our best efforts to avoid generate corrupted files (i.e. through atomic writes to files.). When deserializing, we do best efforts to detect corrupt files and recover conservatively - the success of recovery might be case by case - for example for this disk full case the proposed fix sounds pretty safe to perform while in other cases it might not be straightforward to tell which data is good and which is bad. As a result - the expectation is when things crash and files corrupted, ZK should be able to recover later without manual intervention. This would be good for users.
        Hide
        abrahamfine Abraham Fine added a comment -

        Michael Han I do not see an issue with the generation of invalid log files as long as no data is lost and the system knows how to handle them without user intervention especially if preventing this would have an impact on performance.

        while in other cases it might not be straightforward to tell which data is good and which is bad

        Would you mind explaining what cases you are referring to?

        Show
        abrahamfine Abraham Fine added a comment - Michael Han I do not see an issue with the generation of invalid log files as long as no data is lost and the system knows how to handle them without user intervention especially if preventing this would have an impact on performance. while in other cases it might not be straightforward to tell which data is good and which is bad Would you mind explaining what cases you are referring to?
        Hide
        abhishekrai Abhishek Rai added a comment -

        Based on the discussion with Meyer Kizner above, skipping of the truncated txn log file is insufficient, and its deletion is necessary. Otherwise we can run into problems in two places:

        • FileTxnLog is required to include the latest txn log before the snapshot that it's loading. If that latest txn log is truncated (and previously skipped), then it can incorrectly satisfy this requirement. Instead, if we delete the truncated file, then we are forced to reach back into the older valid txn log.
        • PurgeTxnLog has similar logic about retaining the latest txn log before the last retained snapshot. Therefore, without the deletion, its requirements would similarly be met by a truncated and useless txn log.

        I've now updated Michi Mutsuzaki's patch with two changes and corresponding testing changes:

        • Deletion as described here.
        • Use a tighter exception (EOFException) instead of IOException.
        Show
        abhishekrai Abhishek Rai added a comment - Based on the discussion with Meyer Kizner above, skipping of the truncated txn log file is insufficient, and its deletion is necessary. Otherwise we can run into problems in two places: FileTxnLog is required to include the latest txn log before the snapshot that it's loading. If that latest txn log is truncated (and previously skipped), then it can incorrectly satisfy this requirement. Instead, if we delete the truncated file, then we are forced to reach back into the older valid txn log. PurgeTxnLog has similar logic about retaining the latest txn log before the last retained snapshot. Therefore, without the deletion, its requirements would similarly be met by a truncated and useless txn log. I've now updated Michi Mutsuzaki 's patch with two changes and corresponding testing changes: Deletion as described here. Use a tighter exception (EOFException) instead of IOException.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12837140/ZOOKEEPER-1621.2.patch
        against trunk revision bcb07a09b06c91243ed244f04a71b8daf629e286.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 19 new Findbugs (version 3.0.1) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3513//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3513//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3513//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12837140/ZOOKEEPER-1621.2.patch against trunk revision bcb07a09b06c91243ed244f04a71b8daf629e286. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 19 new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3513//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3513//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3513//console This message is automatically generated.
        Hide
        jgrassler Johannes Grassler added a comment -

        This has been open and unchanged for quite a while now, and the existing patch targets 3.5...has there been any progress on fixing this in the 3.4 branch (I am maintaining a Zookeeper 3.4.x package for OpenSUSE and if there is a fix that targets 3.4.x I'd like to include it).

        Show
        jgrassler Johannes Grassler added a comment - This has been open and unchanged for quite a while now, and the existing patch targets 3.5...has there been any progress on fixing this in the 3.4 branch (I am maintaining a Zookeeper 3.4.x package for OpenSUSE and if there is a fix that targets 3.4.x I'd like to include it).
        Hide
        jeffwidman Jeff Widman added a comment -

        Any update on this?

        It says 3.5.4, but looks like it hasn't been merged yet... despite (as best I can tell from the comments) consensus that the patch is an improvement over the current behavior.

        Show
        jeffwidman Jeff Widman added a comment - Any update on this? It says 3.5.4, but looks like it hasn't been merged yet... despite (as best I can tell from the comments) consensus that the patch is an improvement over the current behavior.

          People

          • Assignee:
            michim Michi Mutsuzaki
            Reporter:
            mumrah David Arthur
          • Votes:
            6 Vote for this issue
            Watchers:
            25 Start watching this issue

            Dates

            • Created:
              Updated:

              Development