ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1333

NPE in FileTxnSnapLog when restarting a cluster

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.4.2, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I think a NPE was created in the fix for https://issues.apache.org/jira/browse/ZOOKEEPER-1269

      Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely that if rc.err != Code.OK then rc.path will be null.

      I'm currently working on a minimal test case for the bug, I'll attach it to this issue when it's ready.

      java.lang.NullPointerException
      at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203)
      at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150)
      at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
      at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418)
      at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410)
      at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
      at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
      at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

      1. ZOOKEEPER-1333_trunk.patch
        12 kB
        Patrick Hunt
      2. ZOOKEEPER-1333_a.patch
        12 kB
        Patrick Hunt
      3. ZOOKEEPER-1333.patch_a
        12 kB
        Mahadev konar
      4. ZOOKEEPER-1333.patch
        12 kB
        Patrick Hunt
      5. ZOOKEEPER-1333.patch
        12 kB
        Patrick Hunt
      6. test_case.diff
        3 kB
        Andrew McNair
      7. test_case.diff
        3 kB
        Andrew McNair

        Activity

        Hide
        Andrew McNair added a comment -

        Added a test case in LoadFromLogTest that fails:

        /**

        • Test we can restore a snapshot that has errors and data ahead of the zxid
        • of the snapshot file.
          */
          @Test
          public void testRestoreWithTransactionErrors() throws Exception {
          // setup a single server cluster
          File tmpDir = ClientBase.createTmpDir();
          ClientBase.setupTestEnv();
          ZooKeeperServer zks = new ZooKeeperServer(tmpDir, tmpDir, 3000);
          SyncRequestProcessor.setSnapCount(10000);
          final int PORT = Integer.parseInt(HOSTPORT.split(":")[1]);
          ServerCnxnFactory f = ServerCnxnFactory.createFactory(PORT, -1);
          f.startup(zks);
          Assert.assertTrue("waiting for server being up ", ClientBase
          .waitForServerUp(HOSTPORT, CONNECTION_TIMEOUT));
          ZooKeeper zk = new ZooKeeper(HOSTPORT, CONNECTION_TIMEOUT, this);

        long start = System.currentTimeMillis();
        while (!connected) {
        long end = System.currentTimeMillis();
        if (end - start > 5000)

        { Assert.assertTrue("Could not connect with server in 5 seconds", false); }

        try

        { Thread.sleep(200); }

        catch (Exception e)

        { LOG.warn("Intrrupted"); }

        }
        // generate some transactions
        try {
        for (int i = 0; i < NUM_MESSAGES; i++) {
        try

        { zk.create("/invaliddir/test-", new byte[0], Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL); }

        catch(NoNodeException e)

        { //Expected }

        }
        } finally

        { zk.close(); }

        // force the zxid to be behind the content
        zks.getZKDatabase().setlastProcessedZxid(
        zks.getZKDatabase().getDataTreeLastProcessedZxid() - 10);
        LOG.info("Set lastProcessedZxid to "
        + zks.getZKDatabase().getDataTreeLastProcessedZxid());

        // Force snapshot and restore
        zks.takeSnapshot();
        zks.shutdown();
        f.shutdown();

        zks = new ZooKeeperServer(tmpDir, tmpDir, 3000);
        SyncRequestProcessor.setSnapCount(10000);
        f = ServerCnxnFactory.createFactory(PORT, -1);
        f.startup(zks);
        Assert.assertTrue("waiting for server being up ", ClientBase
        .waitForServerUp(HOSTPORT, CONNECTION_TIMEOUT));

        f.shutdown();
        }

        Show
        Andrew McNair added a comment - Added a test case in LoadFromLogTest that fails: /** Test we can restore a snapshot that has errors and data ahead of the zxid of the snapshot file. */ @Test public void testRestoreWithTransactionErrors() throws Exception { // setup a single server cluster File tmpDir = ClientBase.createTmpDir(); ClientBase.setupTestEnv(); ZooKeeperServer zks = new ZooKeeperServer(tmpDir, tmpDir, 3000); SyncRequestProcessor.setSnapCount(10000); final int PORT = Integer.parseInt(HOSTPORT.split(":") [1] ); ServerCnxnFactory f = ServerCnxnFactory.createFactory(PORT, -1); f.startup(zks); Assert.assertTrue("waiting for server being up ", ClientBase .waitForServerUp(HOSTPORT, CONNECTION_TIMEOUT)); ZooKeeper zk = new ZooKeeper(HOSTPORT, CONNECTION_TIMEOUT, this); long start = System.currentTimeMillis(); while (!connected) { long end = System.currentTimeMillis(); if (end - start > 5000) { Assert.assertTrue("Could not connect with server in 5 seconds", false); } try { Thread.sleep(200); } catch (Exception e) { LOG.warn("Intrrupted"); } } // generate some transactions try { for (int i = 0; i < NUM_MESSAGES; i++) { try { zk.create("/invaliddir/test-", new byte[0], Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL); } catch(NoNodeException e) { //Expected } } } finally { zk.close(); } // force the zxid to be behind the content zks.getZKDatabase().setlastProcessedZxid( zks.getZKDatabase().getDataTreeLastProcessedZxid() - 10); LOG.info("Set lastProcessedZxid to " + zks.getZKDatabase().getDataTreeLastProcessedZxid()); // Force snapshot and restore zks.takeSnapshot(); zks.shutdown(); f.shutdown(); zks = new ZooKeeperServer(tmpDir, tmpDir, 3000); SyncRequestProcessor.setSnapCount(10000); f = ServerCnxnFactory.createFactory(PORT, -1); f.startup(zks); Assert.assertTrue("waiting for server being up ", ClientBase .waitForServerUp(HOSTPORT, CONNECTION_TIMEOUT)); f.shutdown(); }
        Hide
        Andrew McNair added a comment -

        Added file with failing test case

        Show
        Andrew McNair added a comment - Added file with failing test case
        Hide
        Patrick Hunt added a comment -

        Andrew can you re-attach with "grant license" checked? Otw we can't include your test case as part of the commit.

        Show
        Patrick Hunt added a comment - Andrew can you re-attach with "grant license" checked? Otw we can't include your test case as part of the commit.
        Hide
        Andrew McNair added a comment -

        Attaching failing test case with license

        Show
        Andrew McNair added a comment - Attaching failing test case with license
        Hide
        Patrick Hunt added a comment -

        This fixes the NPE, however the test is still failing. I don't think we should be erroring out on a NONODE, that just seems wrong on it's face (but that was the goal of the change introduced by ZOOKEEPER-1269 which is really odd)

        Show
        Patrick Hunt added a comment - This fixes the NPE, however the test is still failing. I don't think we should be erroring out on a NONODE, that just seems wrong on it's face (but that was the goal of the change introduced by ZOOKEEPER-1269 which is really odd)
        Hide
        Benjamin Reed added a comment -

        there are times that you don't error out on NONODE: when you are applying the transaction log to a fuzzy snapshot. you might have delete /foo in the transaction log and the snapshot already has the node deleted. you want to ignore the error when you apply delete /foo

        Show
        Benjamin Reed added a comment - there are times that you don't error out on NONODE: when you are applying the transaction log to a fuzzy snapshot. you might have delete /foo in the transaction log and the snapshot already has the node deleted. you want to ignore the error when you apply delete /foo
        Hide
        Benjamin Reed added a comment -

        i looked more closely and it looks like before 1269 the code that was throwing an exception was protected by a check to OpCode.create. we should probably put that back in and fix the resulting message a bit.

        Show
        Benjamin Reed added a comment - i looked more closely and it looks like before 1269 the code that was throwing an exception was protected by a check to OpCode.create. we should probably put that back in and fix the resulting message a bit.
        Hide
        Patrick Hunt added a comment -

        this patch fixes the problem as outlined in the comments.

        However I just noticed that the tests still pass when the "throw NoNodeException" is commented out. So this leads me to believe we are not testing this code path.

        I'm continuing to work on this - specifically added tests for this code path.

        Ben if you had an idea re the logging lmk what it should say.

        Show
        Patrick Hunt added a comment - this patch fixes the problem as outlined in the comments. However I just noticed that the tests still pass when the "throw NoNodeException" is commented out. So this leads me to believe we are not testing this code path. I'm continuing to work on this - specifically added tests for this code path. Ben if you had an idea re the logging lmk what it should say.
        Hide
        Benjamin Reed added a comment -

        +1 this looks good. i can't think of a better message for the log

        Show
        Benjamin Reed added a comment - +1 this looks good. i can't think of a better message for the log
        Hide
        Thomas Koch added a comment -

        This is the mail that triggered ZOOKEEPER-1269:
        http://mail-archives.apache.org/mod_mbox/zookeeper-dev/201110.mbox/%3C201110281704.24935.thomas@koch.ro%3E

        It describes three suspected failure scenarios. Only the text of the first one got was copied in 1269 and somehow addressed. I don't know, whether the other two are valid.

        Show
        Thomas Koch added a comment - This is the mail that triggered ZOOKEEPER-1269 : http://mail-archives.apache.org/mod_mbox/zookeeper-dev/201110.mbox/%3C201110281704.24935.thomas@koch.ro%3E It describes three suspected failure scenarios. Only the text of the first one got was copied in 1269 and somehow addressed. I don't know, whether the other two are valid.
        Hide
        Camille Fournier added a comment -

        The logic in FileTxnSnapLog changed quite a bit, and I'm not sure if the create check makes sense with the new logic or not. The create check logic was moved into DataTree, so what I made the check in FileTxnSnapLog for I'm not entirely sure.

        Show
        Camille Fournier added a comment - The logic in FileTxnSnapLog changed quite a bit, and I'm not sure if the create check makes sense with the new logic or not. The create check logic was moved into DataTree, so what I made the check in FileTxnSnapLog for I'm not entirely sure.
        Hide
        Camille Fournier added a comment -

        Ah ok. Yeah, so if we put the create check in, we won't get that nonode exception if the multi fails on that, would be the only potential issue with this fix that I can see.

        Show
        Camille Fournier added a comment - Ah ok. Yeah, so if we put the create check in, we won't get that nonode exception if the multi fails on that, would be the only potential issue with this fix that I can see.
        Hide
        Camille Fournier added a comment -

        But I'm pretty sure that nonode exception itself was kind of a crazy sanity check of the "we should never reach this" sort. To get there you would have to be creating a child node that already exists, but with a parent that doesn't exist. So it's no surprise that we don't have a test for that case.

        Show
        Camille Fournier added a comment - But I'm pretty sure that nonode exception itself was kind of a crazy sanity check of the "we should never reach this" sort. To get there you would have to be creating a child node that already exists, but with a parent that doesn't exist. So it's no surprise that we don't have a test for that case.
        Hide
        Mahadev konar added a comment -

        @Camille,
        Agreed. I think the patch as it stands is good to go. The only concern I have is the code is pretty convoluted in processtransaction. We should work on making it cleaner. Ill add some comments for now when committing.

        Show
        Mahadev konar added a comment - @Camille, Agreed. I think the patch as it stands is good to go. The only concern I have is the code is pretty convoluted in processtransaction. We should work on making it cleaner. Ill add some comments for now when committing.
        Hide
        Mahadev konar added a comment -

        Updated patch with some comments that I committed to 3.4 branch.

        Show
        Mahadev konar added a comment - Updated patch with some comments that I committed to 3.4 branch.
        Hide
        Patrick Hunt added a comment -

        Minor improvement to the 3.4 version of the patch (_a) to make the logging more clear.

        Show
        Patrick Hunt added a comment - Minor improvement to the 3.4 version of the patch (_a) to make the logging more clear.
        Hide
        Patrick Hunt added a comment -

        Merged into the trunk. Includes the logging improvement.

        Show
        Patrick Hunt added a comment - Merged into the trunk. Includes the logging improvement.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12508288/ZOOKEEPER-1333_trunk.patch
        against trunk revision 1214571.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/845//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/845//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/845//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508288/ZOOKEEPER-1333_trunk.patch against trunk revision 1214571. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/845//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/845//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/845//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12508288/ZOOKEEPER-1333_trunk.patch
        against trunk revision 1214571.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/846//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/846//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/846//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508288/ZOOKEEPER-1333_trunk.patch against trunk revision 1214571. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/846//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/846//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/846//console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        I just committed this to trunk and the branch. Thanks Pat and Andrew!

        Show
        Mahadev konar added a comment - I just committed this to trunk and the branch. Thanks Pat and Andrew!
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1404 (See https://builds.apache.org/job/ZooKeeper-trunk/1404/)
        ZOOKEEPER-1333. NPE in FileTxnSnapLog when restarting a cluster. (Patrick Hunt via mahadev)

        mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221868
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1404 (See https://builds.apache.org/job/ZooKeeper-trunk/1404/ ) ZOOKEEPER-1333 . NPE in FileTxnSnapLog when restarting a cluster. (Patrick Hunt via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221868 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java

          People

          • Assignee:
            Patrick Hunt
            Reporter:
            Andrew McNair
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development