ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1453

corrupted logs may not be correctly identified by FileTxnIterator

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.3.3
    • Fix Version/s: None
    • Component/s: server
    • Labels:
      None

      Description

      See ZOOKEEPER-1449 for background on this issue. The main problem is that during server recovery org.apache.zookeeper.server.persistence.FileTxnLog.FileTxnIterator.next() does not indicate if the available logs are valid or not. In some cases (say a truncated record and a single txnlog in the datadir) we will not detect that the file is corrupt, vs reaching the end of the file.

      1. 10.10.5.42.tar.gz
        1.09 MB
        Marshall McMullen
      2. 10.10.5.44.tar.gz
        1.71 MB
        Marshall McMullen
      3. 10.10.5.123.tar.gz
        4.16 MB
        Marshall McMullen
      4. 10.10.5.42-withPath1489.tar.gz
        871 kB
        Marshall McMullen
      5. 10.10.5.44-withPath1489.tar.gz
        1.50 MB
        Marshall McMullen
      6. 10.10.5.123-withPath1489.tar.gz
        4.51 MB
        Marshall McMullen

        Activity

        Hide
        Marshall McMullen added a comment -

        We've hit this bug a few times now and are wondering if there are any thoughts around this or what the fix for this would entail. Perhaps adding an md5 file for each snapshot/log file. Then on startup it can discard any that are corrupted... Thoughts?

        Show
        Marshall McMullen added a comment - We've hit this bug a few times now and are wondering if there are any thoughts around this or what the fix for this would entail. Perhaps adding an md5 file for each snapshot/log file. Then on startup it can discard any that are corrupted... Thoughts?
        Hide
        Patrick Hunt added a comment -

        I don't think we need to do that, we just need to handle these cases better. We already have a checksum on the record but we're not handling all the cases.

        Show
        Patrick Hunt added a comment - I don't think we need to do that, we just need to handle these cases better. We already have a checksum on the record but we're not handling all the cases.
        Hide
        Marshall McMullen added a comment -

        Oh, I didn't realize we already had a checksum on the record, that's helpful. Can you elaborate a little more on what cases we're not handling properly? I'd definitely like to help work on this but don't have any idea where to start looking.

        Show
        Marshall McMullen added a comment - Oh, I didn't realize we already had a checksum on the record, that's helpful. Can you elaborate a little more on what cases we're not handling properly? I'd definitely like to help work on this but don't have any idea where to start looking.
        Hide
        Patrick Hunt added a comment -

        See this comment, it's a very clear (and easy to reproduce) use case:
        https://issues.apache.org/jira/browse/ZOOKEEPER-1449?focusedCommentId=13261071&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13261071

        Basically the crc is after the record, we don't have protection against issues prior to verifying the xsum (it's more complicated than that though - see the linked comment).

        Show
        Patrick Hunt added a comment - See this comment, it's a very clear (and easy to reproduce) use case: https://issues.apache.org/jira/browse/ZOOKEEPER-1449?focusedCommentId=13261071&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13261071 Basically the crc is after the record, we don't have protection against issues prior to verifying the xsum (it's more complicated than that though - see the linked comment).
        Hide
        Marshall McMullen added a comment - - edited

        Patrick, thanks for taking the time to explain.. I read throw the other bug and your explanation is very clear. I'd like to work on a fix for this as it's hitting us very frequently with a stress test we do where we continually reboot one of our machines that is hosting one of our zk servers. Anyhow, I'm looking at the FileTxnIterator code, and I definitely see the bug in next() method in that it always assumes EOF is success. Have you given thought to the right solution here? Maybe giving precedence to validating CRC before checking for EOF?

        What do you think about this:

        public boolean next() throws IOException {
            if (ia == null) {
                return false;
            }
            try {
                long crcValue = ia.readLong("crcvalue");
                byte[] bytes = Util.readTxnBytes(ia);
                // validate CRC
                Checksum crc = makeChecksumAlgorithm();
                if (bytes) {
                    crc.update(bytes, 0, bytes.length);
                }
                if (crcValue != crc.getValue())
                    throw new IOException(CRC_ERROR);
                if (bytes == null || bytes.length == 0)
                    throw new EOFException("Failed to read " + logFile);
                hdr = new TxnHeader();
                record = SerializeUtils.deserializeTxn(bytes, hdr);
            } catch (EOFException e) {
            ...
        
        Show
        Marshall McMullen added a comment - - edited Patrick, thanks for taking the time to explain.. I read throw the other bug and your explanation is very clear. I'd like to work on a fix for this as it's hitting us very frequently with a stress test we do where we continually reboot one of our machines that is hosting one of our zk servers. Anyhow, I'm looking at the FileTxnIterator code, and I definitely see the bug in next() method in that it always assumes EOF is success. Have you given thought to the right solution here? Maybe giving precedence to validating CRC before checking for EOF? What do you think about this: public boolean next() throws IOException { if (ia == null) { return false; } try { long crcValue = ia.readLong("crcvalue"); byte[] bytes = Util.readTxnBytes(ia); // validate CRC Checksum crc = makeChecksumAlgorithm(); if (bytes) { crc.update(bytes, 0, bytes.length); } if (crcValue != crc.getValue()) throw new IOException(CRC_ERROR); if (bytes == null || bytes.length == 0) throw new EOFException("Failed to read " + logFile); hdr = new TxnHeader(); record = SerializeUtils.deserializeTxn(bytes, hdr); } catch (EOFException e) { ...
        Hide
        Patrick Hunt added a comment -

        Marshall are you sure you have this fix? https://issues.apache.org/jira/browse/ZOOKEEPER-1156

        I think it would be good to fix this issue, however I'm more concerned about the fact that you're seeing corruption in your stress test environment. I think we need to track down the cause of that, my initial reaction is that you are seeing some bug that we need to fix. The ability to handle a corrupted log file is secondary to actually fixing the issue causing the corruption in the first place. Any insights on what might be causing the issue you are seeing? Perhaps you could file a bug report with log4j logs and a copy of your datadir?

        To answer your question, I think we need to capture any failure to verify the crc as a failure, rather than EOF (as you suggest).

        Show
        Patrick Hunt added a comment - Marshall are you sure you have this fix? https://issues.apache.org/jira/browse/ZOOKEEPER-1156 I think it would be good to fix this issue, however I'm more concerned about the fact that you're seeing corruption in your stress test environment. I think we need to track down the cause of that, my initial reaction is that you are seeing some bug that we need to fix. The ability to handle a corrupted log file is secondary to actually fixing the issue causing the corruption in the first place. Any insights on what might be causing the issue you are seeing? Perhaps you could file a bug report with log4j logs and a copy of your datadir? To answer your question, I think we need to capture any failure to verify the crc as a failure, rather than EOF (as you suggest).
        Hide
        Marshall McMullen added a comment -

        I'm running trunk revision 1339307, and it looks like ZOOKEEPER-1156 is included therein. So I think that I'm OK with regard to that particular bug.

        You bring up a very valid point. I hadn't thought about pinpointing why we are seeing this corruption, only that we are and trying to prevent it from causing that node to not re-join the ensemble when it comes back up.

        The next time this happens I'll grab log files and a copy of the datadir.

        Thanks!

        Show
        Marshall McMullen added a comment - I'm running trunk revision 1339307, and it looks like ZOOKEEPER-1156 is included therein. So I think that I'm OK with regard to that particular bug. You bring up a very valid point. I hadn't thought about pinpointing why we are seeing this corruption, only that we are and trying to prevent it from causing that node to not re-join the ensemble when it comes back up. The next time this happens I'll grab log files and a copy of the datadir. Thanks!
        Hide
        Flavio Junqueira added a comment -

        Hi Marshall, One quick question here. When you say you're rebooting servers in your setting, I was wondering how you're doing that exactly.

        Show
        Flavio Junqueira added a comment - Hi Marshall, One quick question here. When you say you're rebooting servers in your setting, I was wondering how you're doing that exactly.
        Hide
        Marshall McMullen added a comment -

        Flavio,

        This is on Linux servers, and we're trying to simulate non-graceful node failures, so we're calling "reboot -f". Since this doesn't call shutdown, it doesn't allow zookeeper a chance to gracefully shutdown. What I suspect is happening is if zookeeper happened to be in the middle of writing its logs or snapshots out to disk, this would get truncated or suffer some other file system corruption. When it comes back up we restart zookeeper and it never rejoins the ensemble.

        Show
        Marshall McMullen added a comment - Flavio, This is on Linux servers, and we're trying to simulate non-graceful node failures, so we're calling "reboot -f". Since this doesn't call shutdown, it doesn't allow zookeeper a chance to gracefully shutdown. What I suspect is happening is if zookeeper happened to be in the middle of writing its logs or snapshots out to disk, this would get truncated or suffer some other file system corruption. When it comes back up we restart zookeeper and it never rejoins the ensemble.
        Hide
        Marshall McMullen added a comment -

        I was able to reproduce this problem again. After I power cycled the server a few times, the node in question refuses to join the ensemble and no clients can connect to it. When I try to telnet to the host in question and issue 'stat' it fails with:

        This ZooKeeper instance is not currently serving requests

        I enabled tracing and in the log file as it's starting up it fails with:

        2012-06-24 20:34:31,734 [myid:1] - INFO [main:FileSnap@83][] - Reading snapshot /sf/data/zookeeper/10.10.5.123/version-2/snapshot.0
        2012-06-24 20:34:31,738 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@575][] - Created new input stream /sf/data/zookeeper/10.10.5.123/version-2/log.100000001
        2012-06-24 20:34:31,738 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@578][] - Created new input archive /sf/data/zookeeper/10.10.5.123/version-2/log.100000001
        2012-06-24 20:34:31,763 [myid:1] - DEBUG [main:DataTree@951][] - Ignoring processTxn failure hdr: -1 : error: -110
        2012-06-24 20:34:31,763 [myid:1] - DEBUG [main:FileTxnSnapLog@241][] - Ignoring processTxn failure hdr: -1 : error: -110
        2012-06-24 20:34:31,763 [myid:1] - DEBUG [main:DataTree@951][] - Ignoring processTxn failure hdr: -1 : error: -110

        ...[ repeats many many times ]...

        2012-06-24 20:34:32,065 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@618][] - EOF excepton java.io.EOFException: Failed to read /sf/data/zookeeper/10.10.5.123/version-2/log.100000001
        2012-06-24 20:34:32,067 [myid:1] - INFO [NIOServerCxn.Factory:/10.10.5.123:2181:NIOServerCnxnFactory@227][] - Accepted socket connection from /10.10.5.123:39623
        2012-06-24 20:34:32,069 [myid:1] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@530][] - My election bind port: /10.10.5.123:2183
        2012-06-24 20:34:32,071 [myid:1] - WARN [NIOServerCxn.Factory:/10.10.5.123:2181:NIOServerCnxn@354][] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2012-06-24 20:34:32,071 [myid:1] - DEBUG [NIOServerCxn.Factory:/10.10.5.123:2181:NIOServerCnxn@358][] - IOException stack trace

        I also have a copy of the data directory if it would help.

        Show
        Marshall McMullen added a comment - I was able to reproduce this problem again. After I power cycled the server a few times, the node in question refuses to join the ensemble and no clients can connect to it. When I try to telnet to the host in question and issue 'stat' it fails with: This ZooKeeper instance is not currently serving requests I enabled tracing and in the log file as it's starting up it fails with: 2012-06-24 20:34:31,734 [myid:1] - INFO [main:FileSnap@83] [] - Reading snapshot /sf/data/zookeeper/10.10.5.123/version-2/snapshot.0 2012-06-24 20:34:31,738 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@575] [] - Created new input stream /sf/data/zookeeper/10.10.5.123/version-2/log.100000001 2012-06-24 20:34:31,738 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@578] [] - Created new input archive /sf/data/zookeeper/10.10.5.123/version-2/log.100000001 2012-06-24 20:34:31,763 [myid:1] - DEBUG [main:DataTree@951] [] - Ignoring processTxn failure hdr: -1 : error: -110 2012-06-24 20:34:31,763 [myid:1] - DEBUG [main:FileTxnSnapLog@241] [] - Ignoring processTxn failure hdr: -1 : error: -110 2012-06-24 20:34:31,763 [myid:1] - DEBUG [main:DataTree@951] [] - Ignoring processTxn failure hdr: -1 : error: -110 ...[ repeats many many times ]... 2012-06-24 20:34:32,065 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@618] [] - EOF excepton java.io.EOFException: Failed to read /sf/data/zookeeper/10.10.5.123/version-2/log.100000001 2012-06-24 20:34:32,067 [myid:1] - INFO [NIOServerCxn.Factory:/10.10.5.123:2181:NIOServerCnxnFactory@227] [] - Accepted socket connection from /10.10.5.123:39623 2012-06-24 20:34:32,069 [myid:1] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@530] [] - My election bind port: /10.10.5.123:2183 2012-06-24 20:34:32,071 [myid:1] - WARN [NIOServerCxn.Factory:/10.10.5.123:2181:NIOServerCnxn@354] [] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-06-24 20:34:32,071 [myid:1] - DEBUG [NIOServerCxn.Factory:/10.10.5.123:2181:NIOServerCnxn@358] [] - IOException stack trace I also have a copy of the data directory if it would help.
        Hide
        Marshall McMullen added a comment -

        Data directory of the node that won't rejoin zookeeper ensemble.

        Show
        Marshall McMullen added a comment - Data directory of the node that won't rejoin zookeeper ensemble.
        Hide
        Marshall McMullen added a comment -

        By the way, when you untar the attached zookeeper.tar.gz file, inside is the full zookeeper_trace.log file.

        Show
        Marshall McMullen added a comment - By the way, when you untar the attached zookeeper.tar.gz file, inside is the full zookeeper_trace.log file.
        Hide
        Marshall McMullen added a comment -

        I'm seeing this again, same scenario as described before. Anyone have a chance to look at my log files to see what might be wrong here?

        Also, I'm wondering if the new code for https://issues.apache.org/jira/browse/ZOOKEEPER-1427 could be helpful here. I noticed the code in 1427 makes the writes to acceptedEpoch and currentEpoch files atomic, but what about the log and snapshot files? Are those similarly written atomically? Maybe that is causing my corruption issues....?

        Could really use some help here.

        Show
        Marshall McMullen added a comment - I'm seeing this again, same scenario as described before. Anyone have a chance to look at my log files to see what might be wrong here? Also, I'm wondering if the new code for https://issues.apache.org/jira/browse/ZOOKEEPER-1427 could be helpful here. I noticed the code in 1427 makes the writes to acceptedEpoch and currentEpoch files atomic, but what about the log and snapshot files? Are those similarly written atomically? Maybe that is causing my corruption issues....? Could really use some help here.
        Hide
        Marshall McMullen added a comment -

        Patrick Hunt commented on 1427 that atomic writes of the snapshot and log files is not necessary.

        Anyhow, could still use some advice on what might be wrong here.

        Show
        Marshall McMullen added a comment - Patrick Hunt commented on 1427 that atomic writes of the snapshot and log files is not necessary. Anyhow, could still use some advice on what might be wrong here.
        Hide
        Flavio Junqueira added a comment -

        I don't think the trace file is here, this is the output of ls after untaring it:

        drwxr-xr-x  0 root   root        0 Jun 25 00:54 zookeeper/
        -rw-r-----  0 root   root      247 Jun 25 01:12 zookeeper/10.10.5.123_2181.cfg
        -rw-r-----  0 root   root      176 Jun 25 01:12 zookeeper/10.10.5.123_2181.cfg.dynamic
        -rw-r-----  0 root   root        1 Jun 25 01:12 zookeeper/myid
        drwxr-xr-x  0 root   root        0 Jun 25 00:54 zookeeper/version-2/
        -rw-r--r--  0 root   root        1 Jun 25 00:54 zookeeper/version-2/currentEpoch
        -rw-r--r--  0 root   root        1 Jun 25 00:54 zookeeper/version-2/acceptedEpoch
        -rw-r--r--  0 root   root      669 Jun 25 00:54 zookeeper/version-2/snapshot.0
        -rw-r--r--  0 root   root 67108880 Jun 25 01:09 zookeeper/version-2/log.100000001
        

        I was able to read the transaction log ok, though.

        Show
        Flavio Junqueira added a comment - I don't think the trace file is here, this is the output of ls after untaring it: drwxr-xr-x 0 root root 0 Jun 25 00:54 zookeeper/ -rw-r----- 0 root root 247 Jun 25 01:12 zookeeper/10.10.5.123_2181.cfg -rw-r----- 0 root root 176 Jun 25 01:12 zookeeper/10.10.5.123_2181.cfg.dynamic -rw-r----- 0 root root 1 Jun 25 01:12 zookeeper/myid drwxr-xr-x 0 root root 0 Jun 25 00:54 zookeeper/version-2/ -rw-r--r-- 0 root root 1 Jun 25 00:54 zookeeper/version-2/currentEpoch -rw-r--r-- 0 root root 1 Jun 25 00:54 zookeeper/version-2/acceptedEpoch -rw-r--r-- 0 root root 669 Jun 25 00:54 zookeeper/version-2/snapshot.0 -rw-r--r-- 0 root root 67108880 Jun 25 01:09 zookeeper/version-2/log.100000001 I was able to read the transaction log ok, though.
        Hide
        Marshall McMullen added a comment -

        OK, reproduced this again.... This time I made sure to capture a tar file of the data directory on each node and there is now a zookeeper_trace.log file in each tar file. In this scenario, I had a 3-node ensemble

        {10.10.5.42, 10.10.5.44, 10.10.5.123}

        and I rebooted 10.10.5.42 via issuing "reboot -f" on that node. When the node comes back up, it spews the following log message as before. Around 16:35 it is rebooted, and comes back around 16:38.

        2012-07-06 16:35:51,512 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@158][] - sessionid:0x3385e6145f2000b type:setData cxid:0x4ff77654 zxid:0x100000d4f txntype:5 reqpath:n/a

        ...[REBOOTED]...

        2012-07-06 16:38:38,026 [myid:] - INFO [main:QuorumPeerConfig@99][] - Reading configuration from: /sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg
        2012-07-06 16:38:38,038 [myid:2] - INFO [main:DatadirCleanupManager@78][] - autopurge.snapRetainCount set to 5
        2012-07-06 16:38:38,038 [myid:2] - INFO [main:DatadirCleanupManager@79][] - autopurge.purgeInterval set to 1
        2012-07-06 16:38:38,039 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138][] - Purge task started.
        2012-07-06 16:38:38,045 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:hiearchy=default
        2012-07-06 16:38:38,045 [myid:2] - DEBUG [main:HierarchyDynamicMBean@260][] - postRegister is called.
        2012-07-06 16:38:38,046 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=root
        2012-07-06 16:38:38,047 [myid:2] - DEBUG [main:LoggerDynamicMBean@258][] - Adding AppenderMBean for appender named CONSOLE
        2012-07-06 16:38:38,048 [myid:2] - DEBUG [main:AppenderDynamicMBean@158][] - getMBeanInfo called.
        2012-07-06 16:38:38,048 [myid:2] - DEBUG [main:AppenderDynamicMBean@337][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=CONSOLE
        2012-07-06 16:38:38,048 [myid:2] - DEBUG [main:AppenderDynamicMBean@197][] - Adding LayoutMBean:CONSOLE,layout=org.apache.log4j.PatternLayout
        2012-07-06 16:38:38,049 [myid:2] - DEBUG [main:LayoutDynamicMBean@140][] - getMBeanInfo called.
        2012-07-06 16:38:38,049 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=CONSOLE,layout=org.apache.log4j.PatternLayout
        2012-07-06 16:38:38,049 [myid:2] - DEBUG [main:LoggerDynamicMBean@258][] - Adding AppenderMBean for appender named TRACEFILE
        2012-07-06 16:38:38,050 [myid:2] - DEBUG [main:AppenderDynamicMBean@158][] - getMBeanInfo called.
        2012-07-06 16:38:38,050 [myid:2] - DEBUG [main:AppenderDynamicMBean@337][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=TRACEFILE
        2012-07-06 16:38:38,051 [myid:2] - DEBUG [main:AppenderDynamicMBean@197][] - Adding LayoutMBean:TRACEFILE,layout=org.apache.log4j.PatternLayout
        2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:LayoutDynamicMBean@140][] - getMBeanInfo called.
        2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=TRACEFILE,layout=org.apache.log4j.PatternLayout
        2012-07-06 16:38:38,052 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144][] - Purge task completed.
        2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [root] as listener.
        2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.FileSnap
        2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.server.persistence.FileSnap] as listener.
        2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.LoggerDynamicMBean
        2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.log4j.jmx.LoggerDynamicMBean] as listener.
        2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.jmx.MBeanRegistry
        2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.jmx.MBeanRegistry] as listener.
        2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.FileTxnSnapLog
        2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.server.persistence.FileTxnSnapLog] as listener.
        2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.quorum.QuorumPeerConfig
        2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.server.quorum.QuorumPeerConfig] as listener.
        2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.AppenderDynamicMBean
        2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.log4j.jmx.AppenderDynamicMBean] as listener.
        2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.Util
        2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.server.persistence.Util] as listener.
        2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.quorum.QuorumPeerMain
        2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.server.quorum.QuorumPeerMain] as listener.
        2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.HierarchyDynamicMBean
        2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.log4j.jmx.HierarchyDynamicMBean] as listener.
        2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.DatadirCleanupManager
        2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.server.DatadirCleanupManager] as listener.
        2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.LayoutDynamicMBean
        2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.log4j.jmx.LayoutDynamicMBean] as listener.
        2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:AbstractDynamicMBean@148][] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.FileTxnLog
        2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128][] - ---Adding logger [org.apache.zookeeper.server.persistence.FileTxnLog] as listener.
        2012-07-06 16:38:38,056 [myid:2] - INFO [main:QuorumPeerMain@131][] - Starting quorum peer
        2012-07-06 16:38:38,072 [myid:2] - INFO [main:NIOServerCnxnFactory@108][] - binding to port /10.10.5.42:2181
        2012-07-06 16:38:38,080 [myid:2] - INFO [main:QuorumPeer@1107][] - tickTime set to 2000
        2012-07-06 16:38:38,080 [myid:2] - INFO [main:QuorumPeer@1127][] - minSessionTimeout set to -1
        2012-07-06 16:38:38,080 [myid:2] - INFO [main:QuorumPeer@1138][] - maxSessionTimeout set to -1
        2012-07-06 16:38:38,081 [myid:2] - INFO [main:QuorumPeer@1153][] - initLimit set to 10
        2012-07-06 16:38:38,093 [myid:2] - INFO [main:FileSnap@83][] - Reading snapshot /sf/data/zookeeper/10.10.5.42/version-2/snapshot.0
        2012-07-06 16:38:38,097 [myid:2] - DEBUG [main:FileTxnLog$FileTxnIterator@575][] - Created new input stream /sf/data/zookeeper/10.10.5.42/version-2/log.100000001
        2012-07-06 16:38:38,097 [myid:2] - DEBUG [main:FileTxnLog$FileTxnIterator@578][] - Created new input archive /sf/data/zookeeper/10.10.5.42/version-2/log.100000001
        2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:DataTree@951][] - Ignoring processTxn failure hdr: -1 : error: -110
        2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:FileTxnSnapLog@241][] - Ignoring processTxn failure hdr: -1 : error: -110
        2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:DataTree@951][] - Ignoring processTxn failure hdr: -1 : error: -110
        2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:FileTxnSnapLog@241][] - Ignoring processTxn failure hdr: -1 : error: -110
        2012-07-06 16:38:38,126 [myid:2] - DEBUG [main:DataTree@951][] - Ignoring processTxn failure hdr: -1 : error: -110
        2012-07-06 16:38:38,126 [myid:2] - DEBUG [main:FileTxnSnapLog@241][] - Ignoring processTxn failure hdr: -1 : error: -110

        ... [ REPEATS FOR A LONG TIME ] ...

        2012-07-06 16:38:38,432 [myid:2] - DEBUG [main:FileTxnLog$FileTxnIterator@618][] - EOF excepton java.io.EOFException: Failed to read /sf/data/zookeeper/10.10.5.42/version-2/log.100000001
        2012-07-06 16:38:38,437 [myid:2] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@530][] - My election bind port: /10.10.5.42:2183
        2012-07-06 16:38:38,443 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:QuorumPeer@825][] - Starting quorum peer
        2012-07-06 16:38:38,447 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:QuorumPeer@860][] - LOOKING
        2012-07-06 16:38:38,447 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:QuorumPeer@789][] - Initializing leader election protocol...
        2012-07-06 16:38:38,448 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@732][] - Updating proposal: 2 (newleader), 0x100000d54 (newzxid), -1 (oldleader), 0xffffffffffffffff (oldzxid)
        2012-07-06 16:38:38,449 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@831][] - New election. My id = 2, proposed zxid=0x100000d54
        2012-07-06 16:38:38,449 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:38,450 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:38,450 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:38,450 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@375][] - Opening channel to server 1
        2012-07-06 16:38:38,453 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@381][] - Connected to server 1
        2012-07-06 16:38:38,454 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager$SendWorker@616][] - Address of remote peer: 1
        2012-07-06 16:38:38,456 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@375][] - Opening channel to server 3
        2012-07-06 16:38:38,456 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@381][] - Connected to server 3
        2012-07-06 16:38:38,456 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2)
        2012-07-06 16:38:38,457 [myid:2] - DEBUG [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2
        2012-07-06 16:38:38,457 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version)
        2012-07-06 16:38:38,458 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54
        2012-07-06 16:38:38,458 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1
        2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:QuorumCnxManager@444][] - Queue size: 0
        2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:38,659 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@408][] - There is a connection already for server 1
        2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:38,659 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865][] - Notification time out: 400
        2012-07-06 16:38:38,660 [myid:2] - DEBUG [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2
        2012-07-06 16:38:38,660 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@375][] - Opening channel to server 3
        2012-07-06 16:38:38,660 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@381][] - Connected to server 3
        2012-07-06 16:38:38,661 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2)
        2012-07-06 16:38:38,661 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version)
        2012-07-06 16:38:38,661 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54
        2012-07-06 16:38:38,661 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1
        2012-07-06 16:38:39,061 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:QuorumCnxManager@444][] - Queue size: 0
        2012-07-06 16:38:39,062 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:39,062 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@408][] - There is a connection already for server 1
        2012-07-06 16:38:39,062 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:39,062 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:39,062 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865][] - Notification time out: 800
        2012-07-06 16:38:39,062 [myid:2] - DEBUG [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2
        2012-07-06 16:38:39,062 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@375][] - Opening channel to server 3
        2012-07-06 16:38:39,063 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@381][] - Connected to server 3
        2012-07-06 16:38:39,063 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2)
        2012-07-06 16:38:39,063 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version)
        2012-07-06 16:38:39,064 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54
        2012-07-06 16:38:39,064 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1
        2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:QuorumCnxManager@444][] - Queue size: 0
        2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:39,864 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@408][] - There is a connection already for server 1
        2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch)
        2012-07-06 16:38:39,864 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865][] - Notification time out: 1600
        2012-07-06 16:38:39,865 [myid:2] - DEBUG [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2
        2012-07-06 16:38:39,865 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@375][] - Opening channel to server 3
        2012-07-06 16:38:39,865 [myid:2] - DEBUG [WorkerSender[myid=2]:QuorumCnxManager@381][] - Connected to server 3
        2012-07-06 16:38:39,865 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2)
        2012-07-06 16:38:39,866 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version)
        2012-07-06 16:38:39,866 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54
        2012-07-06 16:38:39,866 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1
        2012-07-06 16:38:41,259 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227][] - Accepted socket connection from /10.10.5.44:50944
        2012-07-06 16:38:41,263 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354][] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2012-07-06 16:38:41,264 [myid:2] - DEBUG [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@358][] - IOException stack trace
        java.io.IOException: ZooKeeperServer not running
        at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:926)
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:229)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:238)
        at java.lang.Thread.run(Thread.java:722)

        Thereafter I see " Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running" over and over. See full trace files on all three nodes in the attachments.

        I really need help on this, so anything anyone can offer will be immensely appreciated.

        Show
        Marshall McMullen added a comment - OK, reproduced this again.... This time I made sure to capture a tar file of the data directory on each node and there is now a zookeeper_trace.log file in each tar file. In this scenario, I had a 3-node ensemble {10.10.5.42, 10.10.5.44, 10.10.5.123} and I rebooted 10.10.5.42 via issuing "reboot -f" on that node. When the node comes back up, it spews the following log message as before. Around 16:35 it is rebooted, and comes back around 16:38. 2012-07-06 16:35:51,512 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@158] [] - sessionid:0x3385e6145f2000b type:setData cxid:0x4ff77654 zxid:0x100000d4f txntype:5 reqpath:n/a ... [REBOOTED] ... 2012-07-06 16:38:38,026 [myid:] - INFO [main:QuorumPeerConfig@99] [] - Reading configuration from: /sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg 2012-07-06 16:38:38,038 [myid:2] - INFO [main:DatadirCleanupManager@78] [] - autopurge.snapRetainCount set to 5 2012-07-06 16:38:38,038 [myid:2] - INFO [main:DatadirCleanupManager@79] [] - autopurge.purgeInterval set to 1 2012-07-06 16:38:38,039 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] [] - Purge task started. 2012-07-06 16:38:38,045 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:hiearchy=default 2012-07-06 16:38:38,045 [myid:2] - DEBUG [main:HierarchyDynamicMBean@260] [] - postRegister is called. 2012-07-06 16:38:38,046 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=root 2012-07-06 16:38:38,047 [myid:2] - DEBUG [main:LoggerDynamicMBean@258] [] - Adding AppenderMBean for appender named CONSOLE 2012-07-06 16:38:38,048 [myid:2] - DEBUG [main:AppenderDynamicMBean@158] [] - getMBeanInfo called. 2012-07-06 16:38:38,048 [myid:2] - DEBUG [main:AppenderDynamicMBean@337] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=CONSOLE 2012-07-06 16:38:38,048 [myid:2] - DEBUG [main:AppenderDynamicMBean@197] [] - Adding LayoutMBean:CONSOLE,layout=org.apache.log4j.PatternLayout 2012-07-06 16:38:38,049 [myid:2] - DEBUG [main:LayoutDynamicMBean@140] [] - getMBeanInfo called. 2012-07-06 16:38:38,049 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=CONSOLE,layout=org.apache.log4j.PatternLayout 2012-07-06 16:38:38,049 [myid:2] - DEBUG [main:LoggerDynamicMBean@258] [] - Adding AppenderMBean for appender named TRACEFILE 2012-07-06 16:38:38,050 [myid:2] - DEBUG [main:AppenderDynamicMBean@158] [] - getMBeanInfo called. 2012-07-06 16:38:38,050 [myid:2] - DEBUG [main:AppenderDynamicMBean@337] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=TRACEFILE 2012-07-06 16:38:38,051 [myid:2] - DEBUG [main:AppenderDynamicMBean@197] [] - Adding LayoutMBean:TRACEFILE,layout=org.apache.log4j.PatternLayout 2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:LayoutDynamicMBean@140] [] - getMBeanInfo called. 2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:appender=TRACEFILE,layout=org.apache.log4j.PatternLayout 2012-07-06 16:38:38,052 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] [] - Purge task completed. 2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [root] as listener. 2012-07-06 16:38:38,052 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.FileSnap 2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.server.persistence.FileSnap] as listener. 2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.LoggerDynamicMBean 2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.log4j.jmx.LoggerDynamicMBean] as listener. 2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.jmx.MBeanRegistry 2012-07-06 16:38:38,053 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.jmx.MBeanRegistry] as listener. 2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.FileTxnSnapLog 2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.server.persistence.FileTxnSnapLog] as listener. 2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.quorum.QuorumPeerConfig 2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.server.quorum.QuorumPeerConfig] as listener. 2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.AppenderDynamicMBean 2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.log4j.jmx.AppenderDynamicMBean] as listener. 2012-07-06 16:38:38,054 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.Util 2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.server.persistence.Util] as listener. 2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.quorum.QuorumPeerMain 2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.server.quorum.QuorumPeerMain] as listener. 2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.HierarchyDynamicMBean 2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.log4j.jmx.HierarchyDynamicMBean] as listener. 2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.DatadirCleanupManager 2012-07-06 16:38:38,055 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.server.DatadirCleanupManager] as listener. 2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.log4j.jmx.LayoutDynamicMBean 2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.log4j.jmx.LayoutDynamicMBean] as listener. 2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:AbstractDynamicMBean@148] [] - preRegister called. Server=com.sun.jmx.mbeanserver.JmxMBeanServer@49b979c0, name=log4j:logger=org.apache.zookeeper.server.persistence.FileTxnLog 2012-07-06 16:38:38,056 [myid:2] - DEBUG [main:HierarchyDynamicMBean@128] [] - ---Adding logger [org.apache.zookeeper.server.persistence.FileTxnLog] as listener. 2012-07-06 16:38:38,056 [myid:2] - INFO [main:QuorumPeerMain@131] [] - Starting quorum peer 2012-07-06 16:38:38,072 [myid:2] - INFO [main:NIOServerCnxnFactory@108] [] - binding to port /10.10.5.42:2181 2012-07-06 16:38:38,080 [myid:2] - INFO [main:QuorumPeer@1107] [] - tickTime set to 2000 2012-07-06 16:38:38,080 [myid:2] - INFO [main:QuorumPeer@1127] [] - minSessionTimeout set to -1 2012-07-06 16:38:38,080 [myid:2] - INFO [main:QuorumPeer@1138] [] - maxSessionTimeout set to -1 2012-07-06 16:38:38,081 [myid:2] - INFO [main:QuorumPeer@1153] [] - initLimit set to 10 2012-07-06 16:38:38,093 [myid:2] - INFO [main:FileSnap@83] [] - Reading snapshot /sf/data/zookeeper/10.10.5.42/version-2/snapshot.0 2012-07-06 16:38:38,097 [myid:2] - DEBUG [main:FileTxnLog$FileTxnIterator@575] [] - Created new input stream /sf/data/zookeeper/10.10.5.42/version-2/log.100000001 2012-07-06 16:38:38,097 [myid:2] - DEBUG [main:FileTxnLog$FileTxnIterator@578] [] - Created new input archive /sf/data/zookeeper/10.10.5.42/version-2/log.100000001 2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:DataTree@951] [] - Ignoring processTxn failure hdr: -1 : error: -110 2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:FileTxnSnapLog@241] [] - Ignoring processTxn failure hdr: -1 : error: -110 2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:DataTree@951] [] - Ignoring processTxn failure hdr: -1 : error: -110 2012-07-06 16:38:38,124 [myid:2] - DEBUG [main:FileTxnSnapLog@241] [] - Ignoring processTxn failure hdr: -1 : error: -110 2012-07-06 16:38:38,126 [myid:2] - DEBUG [main:DataTree@951] [] - Ignoring processTxn failure hdr: -1 : error: -110 2012-07-06 16:38:38,126 [myid:2] - DEBUG [main:FileTxnSnapLog@241] [] - Ignoring processTxn failure hdr: -1 : error: -110 ... [ REPEATS FOR A LONG TIME ] ... 2012-07-06 16:38:38,432 [myid:2] - DEBUG [main:FileTxnLog$FileTxnIterator@618] [] - EOF excepton java.io.EOFException: Failed to read /sf/data/zookeeper/10.10.5.42/version-2/log.100000001 2012-07-06 16:38:38,437 [myid:2] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@530] [] - My election bind port: /10.10.5.42:2183 2012-07-06 16:38:38,443 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:QuorumPeer@825][] - Starting quorum peer 2012-07-06 16:38:38,447 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:QuorumPeer@860][] - LOOKING 2012-07-06 16:38:38,447 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:QuorumPeer@789][] - Initializing leader election protocol... 2012-07-06 16:38:38,448 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@732][] - Updating proposal: 2 (newleader), 0x100000d54 (newzxid), -1 (oldleader), 0xffffffffffffffff (oldzxid) 2012-07-06 16:38:38,449 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@831][] - New election. My id = 2, proposed zxid=0x100000d54 2012-07-06 16:38:38,449 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:38,450 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:38,450 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:38,450 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@375][] - Opening channel to server 1 2012-07-06 16:38:38,453 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@381][] - Connected to server 1 2012-07-06 16:38:38,454 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager$SendWorker@616][] - Address of remote peer: 1 2012-07-06 16:38:38,456 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@375][] - Opening channel to server 3 2012-07-06 16:38:38,456 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@381][] - Connected to server 3 2012-07-06 16:38:38,456 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2) 2012-07-06 16:38:38,457 [myid:2] - DEBUG [WorkerReceiver [myid=2] :FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2 2012-07-06 16:38:38,457 [myid:2] - INFO [WorkerReceiver [myid=2] :FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2012-07-06 16:38:38,458 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54 2012-07-06 16:38:38,458 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1 2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:QuorumCnxManager@444][] - Queue size: 0 2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:38,659 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@408][] - There is a connection already for server 1 2012-07-06 16:38:38,659 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:38,659 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@865][] - Notification time out: 400 2012-07-06 16:38:38,660 [myid:2] - DEBUG [WorkerReceiver [myid=2] :FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2 2012-07-06 16:38:38,660 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@375][] - Opening channel to server 3 2012-07-06 16:38:38,660 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@381][] - Connected to server 3 2012-07-06 16:38:38,661 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2) 2012-07-06 16:38:38,661 [myid:2] - INFO [WorkerReceiver [myid=2] :FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2012-07-06 16:38:38,661 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54 2012-07-06 16:38:38,661 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1 2012-07-06 16:38:39,061 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:QuorumCnxManager@444][] - Queue size: 0 2012-07-06 16:38:39,062 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:39,062 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@408][] - There is a connection already for server 1 2012-07-06 16:38:39,062 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:39,062 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:39,062 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@865][] - Notification time out: 800 2012-07-06 16:38:39,062 [myid:2] - DEBUG [WorkerReceiver [myid=2] :FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2 2012-07-06 16:38:39,062 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@375][] - Opening channel to server 3 2012-07-06 16:38:39,063 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@381][] - Connected to server 3 2012-07-06 16:38:39,063 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2) 2012-07-06 16:38:39,063 [myid:2] - INFO [WorkerReceiver [myid=2] :FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2012-07-06 16:38:39,064 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54 2012-07-06 16:38:39,064 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1 2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:QuorumCnxManager@444][] - Queue size: 0 2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:39,864 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@408][] - There is a connection already for server 1 2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:39,864 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@583][] - Sending Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), 3 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2012-07-06 16:38:39,864 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@865][] - Notification time out: 1600 2012-07-06 16:38:39,865 [myid:2] - DEBUG [WorkerReceiver [myid=2] :FastLeaderElection$Messenger$WorkerReceiver@295][] - Receive new notification message. My id = 2 2012-07-06 16:38:39,865 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@375][] - Opening channel to server 3 2012-07-06 16:38:39,865 [myid:2] - DEBUG [WorkerSender [myid=2] :QuorumCnxManager@381][] - Connected to server 3 2012-07-06 16:38:39,865 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191][] - Have smaller server identifier, so dropping the connection: (3, 2) 2012-07-06 16:38:39,866 [myid:2] - INFO [WorkerReceiver [myid=2] :FastLeaderElection@635][] - Notification: 2 (n.leader), 0x100000d54 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2012-07-06 16:38:39,866 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@652][] - id: 2, proposed id: 2, zxid: 0x100000d54, proposed zxid: 0x100000d54 2012-07-06 16:38:39,866 [myid:2] - DEBUG [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@901][] - Adding vote: from=2, proposed leader=2, proposed zxid=0x100000d54, proposed election epoch=0x1 2012-07-06 16:38:41,259 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] [] - Accepted socket connection from /10.10.5.44:50944 2012-07-06 16:38:41,263 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] [] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-07-06 16:38:41,264 [myid:2] - DEBUG [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@358] [] - IOException stack trace java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:926) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:229) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:238) at java.lang.Thread.run(Thread.java:722) Thereafter I see " Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running" over and over. See full trace files on all three nodes in the attachments. I really need help on this, so anything anyone can offer will be immensely appreciated.
        Hide
        Patrick Hunt added a comment -

        Any chance you could apply the fix for ZOOKEEPER-1489 and try reproducing?

        Show
        Patrick Hunt added a comment - Any chance you could apply the fix for ZOOKEEPER-1489 and try reproducing?
        Hide
        Marshall McMullen added a comment -

        Patrick, yes, I will try that right now and get back to you.

        Show
        Marshall McMullen added a comment - Patrick, yes, I will try that right now and get back to you.
        Hide
        Patrick Hunt added a comment -

        Sweet, thx.

        Show
        Patrick Hunt added a comment - Sweet, thx.
        Hide
        Marshall McMullen added a comment -

        Testing this with the fix from ZOOKEEPER-1489 right now. Usually takes about 30 minutes for our test to fail. Will report back with results when that finishes.

        Show
        Marshall McMullen added a comment - Testing this with the fix from ZOOKEEPER-1489 right now. Usually takes about 30 minutes for our test to fail. Will report back with results when that finishes.
        Hide
        Bill Bridge added a comment -

        The ZooKeeper log records are not sector aligned on disk. When you crash a server there can be disk I/O in flight that is an integral number of sectors. Some will complete and some will not. The completion order is not guaranteed to be the same as the order in the file. There is a reasonable chance that the last log record will be partial. There may be some unwritten sectors in the file and then some blocks with records.

        The code needs to recognize that partial records at the end of a log are a possibility and pretend they were not written. One hazard with doing that is a corruption in the middle of a log might be considered an EOF. One sanity check would be to include in every log record the highest known to be persistent record id. After finding the end of the log the code could scan farther for a valid record and declare the log corrupt if the valid record implies the log was previously committed beyond the apparent EOF.

        Show
        Bill Bridge added a comment - The ZooKeeper log records are not sector aligned on disk. When you crash a server there can be disk I/O in flight that is an integral number of sectors. Some will complete and some will not. The completion order is not guaranteed to be the same as the order in the file. There is a reasonable chance that the last log record will be partial. There may be some unwritten sectors in the file and then some blocks with records. The code needs to recognize that partial records at the end of a log are a possibility and pretend they were not written. One hazard with doing that is a corruption in the middle of a log might be considered an EOF. One sanity check would be to include in every log record the highest known to be persistent record id. After finding the end of the log the code could scan farther for a valid record and declare the log corrupt if the valid record implies the log was previously committed beyond the apparent EOF.
        Hide
        Marshall McMullen added a comment -

        Unfortunately it failed again. I've uploaded new tar files with "-withPath1489" in the name (that was meant to be Patch not Path but anyhow...). Looks to be the same exact symptom. The rebooted node is never able to rejoin the ensemble. Any ideas or thoughts are greatly appreciated.

        If this is a different bug than what's here in ZOOKEEPER-1453, then I can open a separate jira... just let me know.

        Show
        Marshall McMullen added a comment - Unfortunately it failed again. I've uploaded new tar files with "-withPath1489" in the name (that was meant to be Patch not Path but anyhow...). Looks to be the same exact symptom. The rebooted node is never able to rejoin the ensemble. Any ideas or thoughts are greatly appreciated. If this is a different bug than what's here in ZOOKEEPER-1453 , then I can open a separate jira... just let me know.
        Hide
        Flavio Junqueira added a comment -

        I was wondering the same thing as Bill when I asked how you're doing this experiment. One way to verify is to disable the disk write cache and see if the problem goes away.

        Show
        Flavio Junqueira added a comment - I was wondering the same thing as Bill when I asked how you're doing this experiment. One way to verify is to disable the disk write cache and see if the problem goes away.
        Hide
        Marshall McMullen added a comment -

        I disabled write cache on the drive that holds my zookeeper database, and it still fails in exactly the same way :-<.

        Here's the part that really baffles me, I tried removing the on-disk database entirely (the version-2 directory) and starting up zookeeper again on the thought that it would just pull down a fresh copy of the database from one of its peers. Unfortunately it still fails to connect. See the output below:

        root@SF-42:/sf/data# java -cp /opt/zookeeper-3.5.0-p7/zookeeper-3.5.0-p7.jar:/opt/zookeeper-3.5.0-p7/lib/log4j-1.2.16.jar:/opt/zookeeper-3.5.0-p7/lib/commons-cli-1.2.jar:/opt/zookeeper-3.5.0-p7/lib/slf4j-log4j12-1.6.2.jar:/opt/zookeeper-3.5.0-p7/lib/netty-3.2.5.Final.jar:/opt/zookeeper-3.5.0-p7/lib/jline-0.9.94.jar:/opt/zookeeper-3.5.0-p7/lib/javacc.jar:/opt/zookeeper-3.5.0-p7/lib/slf4j-api-1.6.2.jar:/opt/zookeeper-3.5.0-p7/conf -Dzookeeper.root.logger=DEBUG,CONSOLE -Dzookeeper.log.dir=. -Dzookeeper.tracelog.dir=/sf/data/zookeeper/10.10.5.42/ -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -Djute.maxbuffer=4194304 org.apache.zookeeper.server.quorum.QuorumPeerMain /sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg
        2012-07-07 10:20:23,270 [myid:] - INFO [main:QuorumPeerConfig@99] - Reading configuration from: /sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg
        2012-07-07 10:20:23,279 [myid:2] - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 5
        2012-07-07 10:20:23,279 [myid:2] - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
        2012-07-07 10:20:23,280 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
        2012-07-07 10:20:23,289 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
        2012-07-07 10:20:23,290 [myid:2] - INFO [main:QuorumPeerMain@131] - Starting quorum peer
        2012-07-07 10:20:23,300 [myid:2] - INFO [main:NIOServerCnxnFactory@108] - binding to port /10.10.5.42:2181
        2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1107] - tickTime set to 2000
        2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1127] - minSessionTimeout set to -1
        2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1138] - maxSessionTimeout set to -1
        2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1153] - initLimit set to 10
        2012-07-07 10:20:23,321 [myid:2] - INFO [main:QuorumPeer@620] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
        2012-07-07 10:20:23,322 [myid:2] - INFO [main:QuorumPeer@635] - acceptedEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
        2012-07-07 10:20:23,325 [myid:2] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@530] - My election bind port: /10.10.5.42:2183
        2012-07-07 10:20:23,333 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:QuorumPeer@860] - LOOKING
        2012-07-07 10:20:23,334 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@831] - New election. My id = 2, proposed zxid=0x0
        2012-07-07 10:20:23,341 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48534
        2012-07-07 10:20:23,342 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2)
        2012-07-07 10:20:23,342 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@635] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
        2012-07-07 10:20:23,345 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2012-07-07 10:20:23,346 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48534 (no session established for client)
        2012-07-07 10:20:23,544 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865] - Notification time out: 400
        2012-07-07 10:20:23,545 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2)
        2012-07-07 10:20:23,545 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@635] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
        2012-07-07 10:20:23,680 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48535
        2012-07-07 10:20:23,680 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2012-07-07 10:20:23,680 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48535 (no session established for client)
        2012-07-07 10:20:23,946 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865] - Notification time out: 800
        2012-07-07 10:20:23,946 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2)
        2012-07-07 10:20:23,947 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@635] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
        2012-07-07 10:20:24,014 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48536
        2012-07-07 10:20:24,014 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2012-07-07 10:20:24,015 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48536 (no session established for client)
        2012-07-07 10:20:24,349 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48650
        2012-07-07 10:20:24,349 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2012-07-07 10:20:24,349 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48650 (no session established for client)
        2012-07-07 10:20:24,683 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48678
        2012-07-07 10:20:24,683 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2012-07-07 10:20:24,683 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48678 (no session established for client)
        2012-07-07 10:20:24,747 [myid:2] - INFO [QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865] - Notification time out: 1600

        Show
        Marshall McMullen added a comment - I disabled write cache on the drive that holds my zookeeper database, and it still fails in exactly the same way :-<. Here's the part that really baffles me, I tried removing the on-disk database entirely (the version-2 directory) and starting up zookeeper again on the thought that it would just pull down a fresh copy of the database from one of its peers. Unfortunately it still fails to connect. See the output below: root@SF-42:/sf/data# java -cp /opt/zookeeper-3.5.0-p7/zookeeper-3.5.0-p7.jar:/opt/zookeeper-3.5.0-p7/lib/log4j-1.2.16.jar:/opt/zookeeper-3.5.0-p7/lib/commons-cli-1.2.jar:/opt/zookeeper-3.5.0-p7/lib/slf4j-log4j12-1.6.2.jar:/opt/zookeeper-3.5.0-p7/lib/netty-3.2.5.Final.jar:/opt/zookeeper-3.5.0-p7/lib/jline-0.9.94.jar:/opt/zookeeper-3.5.0-p7/lib/javacc.jar:/opt/zookeeper-3.5.0-p7/lib/slf4j-api-1.6.2.jar:/opt/zookeeper-3.5.0-p7/conf -Dzookeeper.root.logger=DEBUG,CONSOLE -Dzookeeper.log.dir=. -Dzookeeper.tracelog.dir=/sf/data/zookeeper/10.10.5.42/ -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -Djute.maxbuffer=4194304 org.apache.zookeeper.server.quorum.QuorumPeerMain /sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg 2012-07-07 10:20:23,270 [myid:] - INFO [main:QuorumPeerConfig@99] - Reading configuration from: /sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg 2012-07-07 10:20:23,279 [myid:2] - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 5 2012-07-07 10:20:23,279 [myid:2] - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1 2012-07-07 10:20:23,280 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started. 2012-07-07 10:20:23,289 [myid:2] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed. 2012-07-07 10:20:23,290 [myid:2] - INFO [main:QuorumPeerMain@131] - Starting quorum peer 2012-07-07 10:20:23,300 [myid:2] - INFO [main:NIOServerCnxnFactory@108] - binding to port /10.10.5.42:2181 2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1107] - tickTime set to 2000 2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1127] - minSessionTimeout set to -1 2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1138] - maxSessionTimeout set to -1 2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1153] - initLimit set to 10 2012-07-07 10:20:23,321 [myid:2] - INFO [main:QuorumPeer@620] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation 2012-07-07 10:20:23,322 [myid:2] - INFO [main:QuorumPeer@635] - acceptedEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation 2012-07-07 10:20:23,325 [myid:2] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@530] - My election bind port: /10.10.5.42:2183 2012-07-07 10:20:23,333 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:QuorumPeer@860] - LOOKING 2012-07-07 10:20:23,334 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@831] - New election. My id = 2, proposed zxid=0x0 2012-07-07 10:20:23,341 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48534 2012-07-07 10:20:23,342 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2) 2012-07-07 10:20:23,342 [myid:2] - INFO [WorkerReceiver [myid=2] :FastLeaderElection@635] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2012-07-07 10:20:23,345 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-07-07 10:20:23,346 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48534 (no session established for client) 2012-07-07 10:20:23,544 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@865] - Notification time out: 400 2012-07-07 10:20:23,545 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2) 2012-07-07 10:20:23,545 [myid:2] - INFO [WorkerReceiver [myid=2] :FastLeaderElection@635] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2012-07-07 10:20:23,680 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48535 2012-07-07 10:20:23,680 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-07-07 10:20:23,680 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48535 (no session established for client) 2012-07-07 10:20:23,946 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@865] - Notification time out: 800 2012-07-07 10:20:23,946 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2) 2012-07-07 10:20:23,947 [myid:2] - INFO [WorkerReceiver [myid=2] :FastLeaderElection@635] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2012-07-07 10:20:24,014 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48536 2012-07-07 10:20:24,014 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-07-07 10:20:24,015 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48536 (no session established for client) 2012-07-07 10:20:24,349 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48650 2012-07-07 10:20:24,349 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-07-07 10:20:24,349 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48650 (no session established for client) 2012-07-07 10:20:24,683 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted socket connection from /10.10.5.44:48678 2012-07-07 10:20:24,683 [myid:2] - WARN [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-07-07 10:20:24,683 [myid:2] - INFO [NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket connection for client /10.10.5.44:48678 (no session established for client) 2012-07-07 10:20:24,747 [myid:2] - INFO [QuorumPeer [myid=2] /10.10.5.42:2181:FastLeaderElection@865] - Notification time out: 1600
        Hide
        Marshall McMullen added a comment -

        Can anyone explain two things in the above that are really puzzling to me:

        (1) Why does it continually show "ZooKeeperServer not running" ? I am certain the process is running. So this must not refer to the process itself but something else. Can someone explain?

        (2) What does this message signify:

        2012-07-07 10:20:23,342 [myid:2] - INFO [WorkerSender[myid=2]:QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2)

        Show
        Marshall McMullen added a comment - Can anyone explain two things in the above that are really puzzling to me: (1) Why does it continually show "ZooKeeperServer not running" ? I am certain the process is running. So this must not refer to the process itself but something else. Can someone explain? (2) What does this message signify: 2012-07-07 10:20:23,342 [myid:2] - INFO [WorkerSender [myid=2] :QuorumCnxManager@191] - Have smaller server identifier, so dropping the connection: (3, 2)
        Hide
        Marshall McMullen added a comment -

        The only way I am able to get this node to rejoin the ensemble is to stop zookeeper, removing it's data directory entirely, then copy the entire version-2 directory from another peer by hand. Then when I start zk on that node everything works as expected.

        Show
        Marshall McMullen added a comment - The only way I am able to get this node to rejoin the ensemble is to stop zookeeper, removing it's data directory entirely, then copy the entire version-2 directory from another peer by hand. Then when I start zk on that node everything works as expected.
        Hide
        Bill Bridge added a comment -

        This is not just about the disk write cache. It is necessary to disable the disk write cache to ensure a commit is truly persistent, but that is not the only source of this problem.

        The problem is that the OS maintains a queue of disk writes. Some number of them have been submitted to the disk and some are still in the queue. These writes are not necessarily in file block order. When there is a reboot the write for the last data block may still be in the queue while the write to the block before it has gone to the disk.

        If the reboot does a hardware reset or is caused by power failure, a similar thing can happen for the writes that have been given to the disk. Disks like to have about 4 - 8 requests queued to them so that they can order them to reduce seek/rotational latency. If there is a reset from the disk controller or loss of power, the disk cannot complete the requests it was given. This can result in some completing and some not. Since the disk reorders based on location of the sectors and the current location of the disk arm and current location of spindle rotation, it is not possible to predict what order writes will complete in.

        Show
        Bill Bridge added a comment - This is not just about the disk write cache. It is necessary to disable the disk write cache to ensure a commit is truly persistent, but that is not the only source of this problem. The problem is that the OS maintains a queue of disk writes. Some number of them have been submitted to the disk and some are still in the queue. These writes are not necessarily in file block order. When there is a reboot the write for the last data block may still be in the queue while the write to the block before it has gone to the disk. If the reboot does a hardware reset or is caused by power failure, a similar thing can happen for the writes that have been given to the disk. Disks like to have about 4 - 8 requests queued to them so that they can order them to reduce seek/rotational latency. If there is a reset from the disk controller or loss of power, the disk cannot complete the requests it was given. This can result in some completing and some not. Since the disk reorders based on location of the sectors and the current location of the disk arm and current location of spindle rotation, it is not possible to predict what order writes will complete in.
        Hide
        Marshall McMullen added a comment -

        Well this seems like a pretty huge limitation if an ensemble node can't reliably survive a power failure on one of the nodes without data corruption.

        Surely someone has seen this before or has an idea how this can be fixed? Why can't we simply write to a *.tmp file and then move it over to the real file only after we know all the writes to the *.tmp file completed? POSIX guarantees rename operation must be atomic (http://pubs.opengroup.org/onlinepubs/009695399/functions/rename.html).

        Show
        Marshall McMullen added a comment - Well this seems like a pretty huge limitation if an ensemble node can't reliably survive a power failure on one of the nodes without data corruption. Surely someone has seen this before or has an idea how this can be fixed? Why can't we simply write to a *.tmp file and then move it over to the real file only after we know all the writes to the *.tmp file completed? POSIX guarantees rename operation must be atomic ( http://pubs.opengroup.org/onlinepubs/009695399/functions/rename.html ).
        Hide
        Marshall McMullen added a comment -

        Two other observations about this massive roadblock that I've seen:

        (1) Far, far more likely to be a problem if it's the leader node that is rebooted

        (2) Once rebooted, even if I clear out the version-2 directory and restart zookeeper on that node it cannot rejoin the ensemble. If I restart one of the other zookeeper processes that is already in the ensemble, then those two nodes can now properly see one another and form quorum. The third node that was already in the ensemble is suddenly not connectable. If I restart zookeeper on that node, then it joins the new ensemble and everything is fine.

        This seems pretty catastrophic to me.

        Show
        Marshall McMullen added a comment - Two other observations about this massive roadblock that I've seen: (1) Far, far more likely to be a problem if it's the leader node that is rebooted (2) Once rebooted, even if I clear out the version-2 directory and restart zookeeper on that node it cannot rejoin the ensemble. If I restart one of the other zookeeper processes that is already in the ensemble, then those two nodes can now properly see one another and form quorum. The third node that was already in the ensemble is suddenly not connectable. If I restart zookeeper on that node, then it joins the new ensemble and everything is fine. This seems pretty catastrophic to me.
        Hide
        Marshall McMullen added a comment -

        Is it possible that the issue I'm seeing isn't necessarily caused by corrupted logs, but some sort of failed leader election? It seems very suspect that if I "reboot -f" the leader node, that it can't joint he ensemble on reboot. But once I restart the other two zk processes, then everyone syncs up and we're OK again. Must be a failure in leader election...?

        Show
        Marshall McMullen added a comment - Is it possible that the issue I'm seeing isn't necessarily caused by corrupted logs, but some sort of failed leader election? It seems very suspect that if I "reboot -f" the leader node, that it can't joint he ensemble on reboot. But once I restart the other two zk processes, then everyone syncs up and we're OK again. Must be a failure in leader election...?
        Hide
        Flavio Junqueira added a comment -

        We force it to disk, so if the disk write cache is disabled, then force returning successfully is an indication that it has been written to media. The observation about the OS buffers only holds if we don't sync to disk.

        I checked the files in the previous tar.gz without the traces, but I couldn't spot anything suspicious. I'll look into the last batch uploaded. Please don't delete old files.

        Show
        Flavio Junqueira added a comment - We force it to disk, so if the disk write cache is disabled, then force returning successfully is an indication that it has been written to media. The observation about the OS buffers only holds if we don't sync to disk. I checked the files in the previous tar.gz without the traces, but I couldn't spot anything suspicious. I'll look into the last batch uploaded. Please don't delete old files.
        Hide
        Flavio Junqueira added a comment -

        I have just checked the ids of the servers, both traces and myid file, and the id of the server 10.10.5.42 is wrong. From the traces, you can actually tell that it was 3 and later switched to 2 after the reboot. Having two servers with the same id seems to be causing the confusion you're observing here.

        Show
        Flavio Junqueira added a comment - I have just checked the ids of the servers, both traces and myid file, and the id of the server 10.10.5.42 is wrong. From the traces, you can actually tell that it was 3 and later switched to 2 after the reboot. Having two servers with the same id seems to be causing the confusion you're observing here.
        Hide
        Marshall McMullen added a comment -

        Oh my gosh, I can't believe it was a simple configuration issue!! Thank you so much for spotting the problem Flavio, you have seriously saved me!

        Sorry to have spammed this jira with noise that wasn't related. It sure looked like it at first.

        Show
        Marshall McMullen added a comment - Oh my gosh, I can't believe it was a simple configuration issue!! Thank you so much for spotting the problem Flavio, you have seriously saved me! Sorry to have spammed this jira with noise that wasn't related. It sure looked like it at first.
        Hide
        Bill Bridge added a comment -

        Glad to hear it was just a configuration issue. I hope fixing the configuration eliminates the corruption. If it does then the partial write problem is much less likely than your experiment implies. However it is a real world issue. If the OS crashes during a file sync, it is possible to have some blocks written and others not. The fix is to recognize partially written records and throw them away at recovery time. Oracle recovery has done this for many years.

        Show
        Bill Bridge added a comment - Glad to hear it was just a configuration issue. I hope fixing the configuration eliminates the corruption. If it does then the partial write problem is much less likely than your experiment implies. However it is a real world issue. If the OS crashes during a file sync, it is possible to have some blocks written and others not. The fix is to recognize partially written records and throw them away at recovery time. Oracle recovery has done this for many years.
        Hide
        Marshall McMullen added a comment -

        Just wanted to confirm that this was indeed the problem. We use zookeeper in an embedded manner so our application code creates the myid file programatically when we start zookeeper. We now know there is a bug in how it's doing that. I can't believe how much time was wasted on such a simple configuration problem. So thankful to Flavio and everyone else here for helping me sort that out. Given how fatal this was, it might have been useful if ZK could have detected multiple servers with the same ID and given a more helpful error message.

        In any event I agree with Bill that there is still a potential issue with corruption from a reboot that should be addressed under this jira.

        Show
        Marshall McMullen added a comment - Just wanted to confirm that this was indeed the problem. We use zookeeper in an embedded manner so our application code creates the myid file programatically when we start zookeeper. We now know there is a bug in how it's doing that. I can't believe how much time was wasted on such a simple configuration problem. So thankful to Flavio and everyone else here for helping me sort that out. Given how fatal this was, it might have been useful if ZK could have detected multiple servers with the same ID and given a more helpful error message. In any event I agree with Bill that there is still a potential issue with corruption from a reboot that should be addressed under this jira.
        Hide
        Flavio Junqueira added a comment -

        I'm glad it worked, Marshall.

        there is still a potential issue with corruption from a reboot that should be addressed under this jira.

        In my understanding, the partial writes during syncs that Bill is referring to would be captured by the crc checks. Is this not enough?

        Show
        Flavio Junqueira added a comment - I'm glad it worked, Marshall. there is still a potential issue with corruption from a reboot that should be addressed under this jira. In my understanding, the partial writes during syncs that Bill is referring to would be captured by the crc checks. Is this not enough?
        Hide
        Bill Bridge added a comment -

        The CRC checks certainly should catch a partially written record at the end of the log. However since this is an expected event we do not want to discard the log as corrupt. We want to ignore the partial record at the end of the log, and any records that might follow. If a CRC failure is always treated as EOF, then corruption that is not from a partial write during a crash will not be treated like corruption. Thus the CRC will no longer fulfill its current role as an assurance of no corruption.

        You can solve this problem by putting a block header on every sector of a log file. The block header includes a check value. Every log write is an integral number of blocks. The log is preformatted to contain valid blocks with an earlier log sequence number. Encountering valid blocks with a lower sequence number is EOF. Encountering a block with an invalid check value is a corruption. This is based on the assumption that a disk write always completes an integral number of sector writes. This is true, but in very rare circumstances writing a block can result in an I/O error on reading the block. This is infrequent enough that the multiple logs ZooKeeper uses should be sufficient protection.

        This is not a quick fix. I think there should be another jira to propose this. In the mean time it may be sufficient to decide a CRC failure is EOF if it is caused by the CRC value being zero. This is based on the fact that you clear ahead with zero to reduce the number of allocations. I am hoping the CRC is the very last value in a record and on disk it is aligned to its size so that a partial record ends in zero. A partial record that ends at the file EOF should also be ignored and not considered corruption.

        Show
        Bill Bridge added a comment - The CRC checks certainly should catch a partially written record at the end of the log. However since this is an expected event we do not want to discard the log as corrupt. We want to ignore the partial record at the end of the log, and any records that might follow. If a CRC failure is always treated as EOF, then corruption that is not from a partial write during a crash will not be treated like corruption. Thus the CRC will no longer fulfill its current role as an assurance of no corruption. You can solve this problem by putting a block header on every sector of a log file. The block header includes a check value. Every log write is an integral number of blocks. The log is preformatted to contain valid blocks with an earlier log sequence number. Encountering valid blocks with a lower sequence number is EOF. Encountering a block with an invalid check value is a corruption. This is based on the assumption that a disk write always completes an integral number of sector writes. This is true, but in very rare circumstances writing a block can result in an I/O error on reading the block. This is infrequent enough that the multiple logs ZooKeeper uses should be sufficient protection. This is not a quick fix. I think there should be another jira to propose this. In the mean time it may be sufficient to decide a CRC failure is EOF if it is caused by the CRC value being zero. This is based on the fact that you clear ahead with zero to reduce the number of allocations. I am hoping the CRC is the very last value in a record and on disk it is aligned to its size so that a partial record ends in zero. A partial record that ends at the file EOF should also be ignored and not considered corruption.
        Hide
        Flavio Junqueira added a comment -

        Hi Bill, Thanks a lot for your input. I really appreciate that you're taking your time to revisit this part of the code. Check some comments I have below, please:

        If a CRC failure is always treated as EOF, then corruption that is not from a partial write during a crash will not be treated like corruption

        I think that if we are to assume corruptions like bit flips and such, we need to do much more than adding CRCs and block-aligned writes. I'm a bit concerned about a full redesign of the transaction log scheme to consider cases that the current fault model of zookeeper does not cover.

        The log is preformatted to contain valid blocks with an earlier log sequence number.

        I'm not sure I understand this step. How do we know the log sequence numbers beforehand?

        It may be sufficient to decide a CRC failure is EOF if it is caused by the CRC value being zero.

        It sounds right to me.

        Show
        Flavio Junqueira added a comment - Hi Bill, Thanks a lot for your input. I really appreciate that you're taking your time to revisit this part of the code. Check some comments I have below, please: If a CRC failure is always treated as EOF, then corruption that is not from a partial write during a crash will not be treated like corruption I think that if we are to assume corruptions like bit flips and such, we need to do much more than adding CRCs and block-aligned writes. I'm a bit concerned about a full redesign of the transaction log scheme to consider cases that the current fault model of zookeeper does not cover. The log is preformatted to contain valid blocks with an earlier log sequence number. I'm not sure I understand this step. How do we know the log sequence numbers beforehand? It may be sufficient to decide a CRC failure is EOF if it is caused by the CRC value being zero. It sounds right to me.
        Hide
        Patrick Hunt added a comment -

        We already handle partial writes via the sentinel iirc.

        My understanding is that we should treat any corruption as EOF as these records were written but not sync'd, therefore we would not have ACK'd the proposal. This sound right? The server will rejoin the quorum, if it does have a "gap" (say due to corruption) it will download the necessary transactions from the leader to catch up with the quorum.

        Show
        Patrick Hunt added a comment - We already handle partial writes via the sentinel iirc. My understanding is that we should treat any corruption as EOF as these records were written but not sync'd, therefore we would not have ACK'd the proposal. This sound right? The server will rejoin the quorum, if it does have a "gap" (say due to corruption) it will download the necessary transactions from the leader to catch up with the quorum.
        Hide
        Bill Bridge added a comment -

        Yes a full redesign of the logging system would be too much for this problem. Corruptions usually come from software bugs that misdirect data when writing or when the storage accidentally fails to write (note that writing to the wrong place can look like sa lost write at the correct place). Another common source is administrator mistakes like copying data to the wrong file or simultaneously assigning the same file to two different uses. Bit flips within a sector do not happen with disks. I think the CRC is a reasonable check value for the kinds of corruption we are likely to encounter.

        Sorry, I omitted a critical point about preformattting logs. The same log file is used over and over again so that there is no allocation when writing. Before the first use, the log is initialized to contain valid empty blocks for log sequence zero. Since every reuse is at a higher sequence number the EOF is the first block with a lower sequence number than the sequence number recorded in the log header. This is how Oracle writes online logs. Oracle Online Redo Log

        ZooKeeper is different, it uses a new file for every log. It incrementally preallocates with zero to batch the allocations, and the zeroes are not forced to disk. The real data writes usually overwrite the zeroes in the filesystem buffer cache. Thus the zeroes are not likely to be on disk if there is a partial write due to a crash. I suppose there are times when the fsync unnecessarily forces the zeros to disk. I guess the consequences of a crash during fsync are file system dependent. Maybe checking for the 0x42 being 0 at the end of a record indicates a partial record when the zeros were flushed earlier, and EOF in the middle of a record means a partial write as well.

        Show
        Bill Bridge added a comment - Yes a full redesign of the logging system would be too much for this problem. Corruptions usually come from software bugs that misdirect data when writing or when the storage accidentally fails to write (note that writing to the wrong place can look like sa lost write at the correct place). Another common source is administrator mistakes like copying data to the wrong file or simultaneously assigning the same file to two different uses. Bit flips within a sector do not happen with disks. I think the CRC is a reasonable check value for the kinds of corruption we are likely to encounter. Sorry, I omitted a critical point about preformattting logs. The same log file is used over and over again so that there is no allocation when writing. Before the first use, the log is initialized to contain valid empty blocks for log sequence zero. Since every reuse is at a higher sequence number the EOF is the first block with a lower sequence number than the sequence number recorded in the log header. This is how Oracle writes online logs. Oracle Online Redo Log ZooKeeper is different, it uses a new file for every log. It incrementally preallocates with zero to batch the allocations, and the zeroes are not forced to disk. The real data writes usually overwrite the zeroes in the filesystem buffer cache. Thus the zeroes are not likely to be on disk if there is a partial write due to a crash. I suppose there are times when the fsync unnecessarily forces the zeros to disk. I guess the consequences of a crash during fsync are file system dependent. Maybe checking for the 0x42 being 0 at the end of a record indicates a partial record when the zeros were flushed earlier, and EOF in the middle of a record means a partial write as well.
        Hide
        Bill Bridge added a comment -

        I am not up to speed on ZooKeeper terminology so I am guessing that the "sentinel iirc" is the 0x42 (one of my favorite constants) at the end of each record. I see that in Util.readTxnBytes it is validated to be 'B' (aka 0x42) and the record is considered partial if it is wrong. This will certainly catch a partial record.

        Is there any way it could legitimately be anything other than 0x00 or 0x42? If not, then it would be marginally more robust to consider it partial only if it is 0x00, and corrupt if not 0x42. This could only happen if something outside of Zookeeper damaged the log after it was written. Once the head DBA (DataBase Administrator) from Amazon said to me "Sometimes blocks just go bad". Amazon has a lot of blocks.

        Show
        Bill Bridge added a comment - I am not up to speed on ZooKeeper terminology so I am guessing that the "sentinel iirc" is the 0x42 (one of my favorite constants) at the end of each record. I see that in Util.readTxnBytes it is validated to be 'B' (aka 0x42) and the record is considered partial if it is wrong. This will certainly catch a partial record. Is there any way it could legitimately be anything other than 0x00 or 0x42? If not, then it would be marginally more robust to consider it partial only if it is 0x00, and corrupt if not 0x42. This could only happen if something outside of Zookeeper damaged the log after it was written. Once the head DBA (DataBase Administrator) from Amazon said to me "Sometimes blocks just go bad". Amazon has a lot of blocks.
        Hide
        Patrick Hunt added a comment -

        Hi Bill.

        I am guessing that the "sentinel iirc" is the 0x42

        that's correct.

        ZooKeeper is different, it uses a new file for every log.

        Ben was talking about changing this at one point. He wanted to reuse log files. Seemed like a good idea but he never completed the patch.

        Show
        Patrick Hunt added a comment - Hi Bill. I am guessing that the "sentinel iirc" is the 0x42 that's correct. ZooKeeper is different, it uses a new file for every log. Ben was talking about changing this at one point. He wanted to reuse log files. Seemed like a good idea but he never completed the patch.

          People

          • Assignee:
            Unassigned
            Reporter:
            Patrick Hunt
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development