ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1427

Writing to local files is done non-atomically

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.3
    • Fix Version/s: 3.4.4, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, the writeLongToFile() function opens the file for truncate, writes the new data, syncs, and then closes. If the process crashes after opening the file but before writing the new data, the file may be left empty, causing ZK to "forget" an earlier promise. Instead, it should use RandomAccessFile to avoid truncating.

      1. ZOOKEEPER-1427.patch
        20 kB
        Patrick Hunt
      2. ZOOKEEPER-1427.patch
        20 kB
        Patrick Hunt
      3. ZOOKEEPER-1427.patch
        21 kB
        Patrick Hunt
      4. ZOOKEEPER-1427_br34.patch
        20 kB
        Patrick Hunt
      5. ZOOKEEPER-1427_br34.patch
        20 kB
        Patrick Hunt
      6. ZOOKEEPER-1427_br34.patch
        21 kB
        Patrick Hunt

        Activity

        Hide
        Patrick Hunt added a comment -

        Comment from the ML that I believe is related to this issue:

        I am currently facing an issue with zookeeper restart.  Version used is
        3.4.3
        
        Following is the scenario.
        
        While setting the currentEpoch or acceptedEpoch to file, the zookeeper
        process was abruptly killed.
        
        This lead to an empty file being left under the dataDir (acceptedEpoch or
        currentEpoch).
        
        On zookeeper restart it fails indicating that null was found in the file.
        
        Show
        Patrick Hunt added a comment - Comment from the ML that I believe is related to this issue: I am currently facing an issue with zookeeper restart. Version used is 3.4.3 Following is the scenario. While setting the currentEpoch or acceptedEpoch to file, the zookeeper process was abruptly killed. This lead to an empty file being left under the dataDir (acceptedEpoch or currentEpoch). On zookeeper restart it fails indicating that null was found in the file.
        Hide
        Patrick Hunt added a comment -

        FYI: Simple workaround is to remove the acceptedEpoch/currentEpoch files and restart the server in question.

        Show
        Patrick Hunt added a comment - FYI: Simple workaround is to remove the acceptedEpoch/currentEpoch files and restart the server in question.
        Show
        Todd Lipcon added a comment - Alternatively, https://github.com/toddlipcon/hadoop-common/blob/qjm-patchseries/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/AtomicFileOutputStream.java might be useful for you.
        Hide
        Patrick Hunt added a comment -

        Hi Todd, I don't think the original suggestion will work (use RAF) - the length of the file may change, in which case you would end up with the possibility of "1000" and then "999" resulting in "9990" (in the error case, ie process failure during update).

        I think we need to use rename for this, that sound right to you? Create the new file as "foo.tmp" and then rename to "foo" once the file is fully created.

        Show
        Patrick Hunt added a comment - Hi Todd, I don't think the original suggestion will work (use RAF) - the length of the file may change, in which case you would end up with the possibility of "1000" and then "999" resulting in "9990" (in the error case, ie process failure during update). I think we need to use rename for this, that sound right to you? Create the new file as "foo.tmp" and then rename to "foo" once the file is fully created.
        Hide
        Patrick Hunt added a comment -

        FYI, I was writing my most recent comment while yours was in-flight.

        Seems we are thinking the same thing.

        Show
        Patrick Hunt added a comment - FYI, I was writing my most recent comment while yours was in-flight. Seems we are thinking the same thing.
        Hide
        Todd Lipcon added a comment -

        Yea, I didn't know whether the data was fixed-length (and <512bytes). Write, fsync, and rename is best. In theory we should fsync the directory, too, but I don't know of any way to do it from Java

        Show
        Todd Lipcon added a comment - Yea, I didn't know whether the data was fixed-length (and <512bytes). Write, fsync, and rename is best. In theory we should fsync the directory, too, but I don't know of any way to do it from Java
        Hide
        Patrick Hunt added a comment -

        I took Todd's suggestion and pulled in AtomicFileOutputStream and it's corresponding test. (thanks HDFS!)

        Take a look at the error handling in QuorumPeer in particular.

        Todd if you could take a look I'd appreciate it. Notice that I added tests for abort, feel free to pull those back.

        Show
        Patrick Hunt added a comment - I took Todd's suggestion and pulled in AtomicFileOutputStream and it's corresponding test. (thanks HDFS!) Take a look at the error handling in QuorumPeer in particular. Todd if you could take a look I'd appreciate it. Notice that I added tests for abort, feel free to pull those back.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12534071/ZOOKEEPER-1427.patch
        against trunk revision 1355541.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 5 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1115//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1115//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1115//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12534071/ZOOKEEPER-1427.patch against trunk revision 1355541. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1115//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1115//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1115//console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        Basically the same as the last patch except:

        1) Missed re-throwing the exception in the QP catch block, fixed that.
        2) added this method to the findbugs exclude - the logic is complicated enough that it's confusing fb (please double check me on this as I'm turning off the safety)

        Show
        Patrick Hunt added a comment - Basically the same as the last patch except: 1) Missed re-throwing the exception in the QP catch block, fixed that. 2) added this method to the findbugs exclude - the logic is complicated enough that it's confusing fb (please double check me on this as I'm turning off the safety)
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12534090/ZOOKEEPER-1427.patch
        against trunk revision 1355541.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 8 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1116//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1116//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1116//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12534090/ZOOKEEPER-1427.patch against trunk revision 1355541. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1116//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1116//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1116//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Great catch, Todd. For the patch Pat is proposing, it might be a good idea to make sure that the server doesn't make further progress if the write disk fail (exception is thrown in writeLongToFile). As was pointed out in this jira, if the server can't fulfill its promise, it shouldn't proceed. You may also want to raise the level of the log message to error in the catch block.

        Show
        Flavio Junqueira added a comment - Great catch, Todd. For the patch Pat is proposing, it might be a good idea to make sure that the server doesn't make further progress if the write disk fail (exception is thrown in writeLongToFile). As was pointed out in this jira, if the server can't fulfill its promise, it shouldn't proceed. You may also want to raise the level of the log message to error in the catch block.
        Hide
        Rakesh R added a comment -

        Its great. Thanks Todd, Pat

        I also come across with the same situation and currentEpoch is coming as empty.

        2012-06-28 14:23:13,307 [myid:2] - ERROR [main:QuorumPeer@453] - Unable to load database on disk
        java.io.IOException: Found null in ../dataDir/version-2/currentEpoch
        	at org.apache.zookeeper.server.quorum.QuorumPeer.readLongFromFile(QuorumPeer.java:1080)
        

        After analysing found the reason for empty file is due to 'no disk space'.

        The patch is really good and proven logic from hdfs. But I'm having one small clarification. Is there any case in zookeeper to look at the '.tmp' file while reading back?

        Say, (Current/accepted)epoch is 3 and the new epoch 4 has updated into '.tmp' file. Unfortunately after updating the '.tmp' file, zk got killed. During next startup whether zk server needs to look at the '.tmp' file as it is having the highest number than the actual epoch?

        -Rakesh

        Show
        Rakesh R added a comment - Its great. Thanks Todd, Pat I also come across with the same situation and currentEpoch is coming as empty. 2012-06-28 14:23:13,307 [myid:2] - ERROR [main:QuorumPeer@453] - Unable to load database on disk java.io.IOException: Found null in ../dataDir/version-2/currentEpoch at org.apache.zookeeper.server.quorum.QuorumPeer.readLongFromFile(QuorumPeer.java:1080) After analysing found the reason for empty file is due to 'no disk space'. The patch is really good and proven logic from hdfs. But I'm having one small clarification. Is there any case in zookeeper to look at the '.tmp' file while reading back? Say, (Current/accepted)epoch is 3 and the new epoch 4 has updated into '.tmp' file. Unfortunately after updating the '.tmp' file, zk got killed. During next startup whether zk server needs to look at the '.tmp' file as it is having the highest number than the actual epoch? -Rakesh
        Hide
        Todd Lipcon added a comment -

        The patch looks good to me. Thanks for the new tests, I'll try to pull them back into HDFS soon.

        Say, (Current/accepted)epoch is 3 and the new epoch 4 has updated into '.tmp' file. Unfortunately after updating the '.tmp' file, zk got killed. During next startup whether zk server needs to look at the '.tmp' file as it is having the highest number than the actual epoch?

        I think it's fine – if it is killed before renaming tmp to the final location, then it also will not have responded to any other process with the "promise". So it's OK to forget the promise – ie it wasn't successfully made.

        Show
        Todd Lipcon added a comment - The patch looks good to me. Thanks for the new tests, I'll try to pull them back into HDFS soon. Say, (Current/accepted)epoch is 3 and the new epoch 4 has updated into '.tmp' file. Unfortunately after updating the '.tmp' file, zk got killed. During next startup whether zk server needs to look at the '.tmp' file as it is having the highest number than the actual epoch? I think it's fine – if it is killed before renaming tmp to the final location, then it also will not have responded to any other process with the "promise". So it's OK to forget the promise – ie it wasn't successfully made.
        Hide
        Flavio Junqueira added a comment -

        Say, (Current/accepted)epoch is 3 and the new epoch 4 has updated into '.tmp' file. Unfortunately after updating the '.tmp' file, zk got killed. During next startup whether zk server needs to look at the '.tmp' file as it is having the highest number than the actual epoch?

        Servers are not required to complete the recovery phase of the protocol, they can crash before completing. There is a correctness issue only if a server goes back to an epoch that it promised not to go back to. It is ok to accept an epoch and not participating in it.

        Show
        Flavio Junqueira added a comment - Say, (Current/accepted)epoch is 3 and the new epoch 4 has updated into '.tmp' file. Unfortunately after updating the '.tmp' file, zk got killed. During next startup whether zk server needs to look at the '.tmp' file as it is having the highest number than the actual epoch? Servers are not required to complete the recovery phase of the protocol, they can crash before completing. There is a correctness issue only if a server goes back to an epoch that it promised not to go back to. It is ok to accept an epoch and not participating in it.
        Hide
        Patrick Hunt added a comment -

        Thanks for the feedback Rakesh/Todd/Flavio.

        it might be a good idea to make sure that the server doesn't make further progress if the write disk fail

        Using Eclipse I backtracked the calls to the atomic operation. It branches quickly, but from what I can see any exceptions thrown when writing the file are handled properly.

        Is there any case in zookeeper to look at the '.tmp' file while reading back?

        I think the comments clarified this. But basically the server never tries to read the .tmp file, it will only write that file and eventually do the rename. If the rename occurs we're done, if the rename fails we attempt to remove the .tmp file, regardless we ignore that file outside of the atomicoutputstream named file creation process. (so I think we're good)

        You may also want to raise the level of the log message to error in the catch block.

        I'll submit an updated patch for this.

        Show
        Patrick Hunt added a comment - Thanks for the feedback Rakesh/Todd/Flavio. it might be a good idea to make sure that the server doesn't make further progress if the write disk fail Using Eclipse I backtracked the calls to the atomic operation. It branches quickly, but from what I can see any exceptions thrown when writing the file are handled properly. Is there any case in zookeeper to look at the '.tmp' file while reading back? I think the comments clarified this. But basically the server never tries to read the .tmp file, it will only write that file and eventually do the rename. If the rename occurs we're done, if the rename fails we attempt to remove the .tmp file, regardless we ignore that file outside of the atomicoutputstream named file creation process. (so I think we're good) You may also want to raise the level of the log message to error in the catch block. I'll submit an updated patch for this.
        Hide
        Patrick Hunt added a comment -

        The new patches are the same as the old, only the logging has been moved from warn to error as noted in previous comment.

        Show
        Patrick Hunt added a comment - The new patches are the same as the old, only the logging has been moved from warn to error as noted in previous comment.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12535016/ZOOKEEPER-1427.patch
        against trunk revision 1355657.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 8 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1122//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1122//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1122//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12535016/ZOOKEEPER-1427.patch against trunk revision 1355657. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1122//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1122//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1122//console This message is automatically generated.
        Hide
        Marshall McMullen added a comment -

        It seems this only provide atomic write of acceptedEpoch and currentEpoch files. What about the other local files like the log and snapshot files? Seems like they should be written atomically as well.

        Show
        Marshall McMullen added a comment - It seems this only provide atomic write of acceptedEpoch and currentEpoch files. What about the other local files like the log and snapshot files? Seems like they should be written atomically as well.
        Hide
        Patrick Hunt added a comment -

        Hi Marshal. That's not the case as both logs and snapshot files are handled in such a way that this mechanism is not needed (and in the case of logs not possible/wanted).

        Show
        Patrick Hunt added a comment - Hi Marshal. That's not the case as both logs and snapshot files are handled in such a way that this mechanism is not needed (and in the case of logs not possible/wanted).
        Hide
        Flavio Junqueira added a comment -

        Just to add a few more comments of top of Pat's. We assume that a server logs to disk before it acks, so if an entry is written partially, then we can safely discard it. We use crcs to check the integrity of entries.

        The same doesn't hold for the epochs. If the files storing them get corrupted, then we might violate a promise of the server that's critical to guarantee the correctness of the replication protocol.

        Show
        Flavio Junqueira added a comment - Just to add a few more comments of top of Pat's. We assume that a server logs to disk before it acks, so if an entry is written partially, then we can safely discard it. We use crcs to check the integrity of entries. The same doesn't hold for the epochs. If the files storing them get corrupted, then we might violate a promise of the server that's critical to guarantee the correctness of the replication protocol.
        Hide
        Patrick Hunt added a comment -

        Flavio you bring up a good point. There is no checksum (etc...) on the epoch files themselves. That's really unfortunate/shortsighted. We need to do a better job on this in future (when adding new file/types).

        Seems we need to add a new jira for this. In particular I notice that while we are catching the numberformatexception we're not logging it. Might be useful to do that in the read method directly for debugging purposes. Also someone should verify all the call paths are handling the thrown IOException (NFE is converted to IOE). See ZOOKEEPER-1507

        Show
        Patrick Hunt added a comment - Flavio you bring up a good point. There is no checksum (etc...) on the epoch files themselves. That's really unfortunate/shortsighted. We need to do a better job on this in future (when adding new file/types). Seems we need to add a new jira for this. In particular I notice that while we are catching the numberformatexception we're not logging it. Might be useful to do that in the read method directly for debugging purposes. Also someone should verify all the call paths are handling the thrown IOException (NFE is converted to IOE). See ZOOKEEPER-1507
        Hide
        Camille Fournier added a comment -

        This patch looks ok to me, +1. Pat do you want to check it in or should I?

        Show
        Camille Fournier added a comment - This patch looks ok to me, +1. Pat do you want to check it in or should I?
        Hide
        Benjamin Reed added a comment -

        +1 great work. one thing that would be nice, but not needed for correctness and for a later jira, is to use a shutdown hook to cleanup the tmp files.

        Show
        Benjamin Reed added a comment - +1 great work. one thing that would be nice, but not needed for correctness and for a later jira, is to use a shutdown hook to cleanup the tmp files.
        Hide
        Patrick Hunt added a comment -

        Committed. Thanks Patrick!

        Show
        Patrick Hunt added a comment - Committed. Thanks Patrick!
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1617 (See https://builds.apache.org/job/ZooKeeper-trunk/1617/)
        ZOOKEEPER-1427. Writing to local files is done non-atomically (phunt) (Revision 1362656)

        Result = SUCCESS
        phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362656
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/common/AtomicFileOutputStream.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/common/IOUtils.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java
        • /zookeeper/trunk/src/java/test/config/findbugsExcludeFile.xml
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/AtomicFileOutputStreamTest.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1617 (See https://builds.apache.org/job/ZooKeeper-trunk/1617/ ) ZOOKEEPER-1427 . Writing to local files is done non-atomically (phunt) (Revision 1362656) Result = SUCCESS phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362656 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/common/AtomicFileOutputStream.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/common/IOUtils.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java /zookeeper/trunk/src/java/test/config/findbugsExcludeFile.xml /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/AtomicFileOutputStreamTest.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java

          People

          • Assignee:
            Patrick Hunt
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development