ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-460

bad testRetry in cppunit tests (hudson failure)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2.1, 3.3.0
    • Component/s: c client, tests
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      the followng code failed on hudson
      http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/371/

      watchctx_t ctx1, ctx2;
      zhandle_t *zk1 = createClient(&ctx1);
      CPPUNIT_ASSERT_EQUAL(true, ctx1.waitForConnected(zk1));
      zhandle_t *zk2 = createClient(&ctx2);
      zookeeper_close(zk1);
      CPPUNIT_ASSERT_EQUAL(true, ctx2.waitForConnected(zk2));

      there's a problem with this test, it assumes that close(1) can be called before createclient(2) gets connected.

      this is not correct: createclient is an async call an in some cases the connection can be established before
      create client returns.

      this shows a failure in this case because client1 was created, then client2 attempted to connect
      but failed due to this on the server (max conn exceeded):
      sprintf(cmd, "export ZKMAXCNXNS=1;%s startClean %s", ZKSERVER_CMD, getHostPorts());

      conn 2 failed and therefore the following assert eventually failed.

      this code should not assume that close(1) will beat connect(2)

      Henry can you take a look?

      1. zookeeper-460.patch
        0.5 kB
        Giridharan Kesavan

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          the build has been failing for the past 6 days, this is very bad – in effect no CI

          Henry, can you look at this or should I?

          Show
          Patrick Hunt added a comment - the build has been failing for the past 6 days, this is very bad – in effect no CI Henry, can you look at this or should I?
          Hide
          Henry Robinson added a comment -

          I'll take a look.

          Show
          Henry Robinson added a comment - I'll take a look.
          Hide
          Henry Robinson added a comment -

          I need a little help getting to the bottom of this (I might be misreading Hudson's logs).

          The code in question is, I think, 'ok' (although a bit dodgy). The idea is to test the ability of a client - that is waiting because the max cnxns limit has been reached - to reconnect once a slot becomes free on the server. So ideally for this test close(1) should happen after createclient(2) has connected. As you say, this is a false assumption as the close might happen before the createClient(2) succeeds so there is no contention, but this should only be giving false positives - the second assert should eventually succeed. What I need to do to improve this is to replace createClient with a call that blocks until we at least know the connection attempt has been made, if that's possible.

          However the most recent Hudson failures don't seem to be related. From build 375:

          [exec] Zookeeper_simpleSystem::testAsyncWatcherAutoReset : assertion
          [exec] Zookeeper_watchers::testDefaultSessionWatcher1 : OK
          [exec] Zookeeper_watchers::testDefaultSessionWatcher2 : OK
          [exec] Zookeeper_watchers::testObjectSessionWatcher1 : OK
          [exec] Zookeeper_watchers::testObjectSessionWatcher2 : OK
          [exec] Zookeeper_watchers::testNodeWatcher1 : OK
          [exec] Zookeeper_watchers::testChildWatcher1 : OK
          [exec] Zookeeper_watchers::testChildWatcher2 : OK
          [exec]
          [exec] /home/hudson/hudson-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestClient.cc:289: Assertion: equality assertion failed [Expected: -101, Actual : -4]
          [exec] Failures !!!
          [exec] Run: 32 Failure total: 1 Failures: 1 Errors: 0
          [exec] make: *** [run-check] Error 1

          and the same from 376 (yesterday's build). These are failing in TestClient (specifically testAsyncWatcherAutoReset). The error here is that a stat completion callback is getting called with ZCONNECTIONLOSS, but is expecting to see ZNONODE, and the assert is failing.

          This test runs fine for me locally, so is the problem a heavily loaded Hudson, causing the connection loss?

          Similarly the failed build you point to, 371, fails TestClientRetry with a broken pipe error which to my novice eye sounds a bit like something falling over under load.

          It looks to me right now like the TestClientRetry code needs improving, but is benign as it should only cause false positives, and we need to understand the reasons why TestClient is failing. Does that sound right?

          Show
          Henry Robinson added a comment - I need a little help getting to the bottom of this (I might be misreading Hudson's logs). The code in question is, I think, 'ok' (although a bit dodgy). The idea is to test the ability of a client - that is waiting because the max cnxns limit has been reached - to reconnect once a slot becomes free on the server. So ideally for this test close(1) should happen after createclient(2) has connected. As you say, this is a false assumption as the close might happen before the createClient(2) succeeds so there is no contention, but this should only be giving false positives - the second assert should eventually succeed. What I need to do to improve this is to replace createClient with a call that blocks until we at least know the connection attempt has been made, if that's possible. However the most recent Hudson failures don't seem to be related. From build 375: [exec] Zookeeper_simpleSystem::testAsyncWatcherAutoReset : assertion [exec] Zookeeper_watchers::testDefaultSessionWatcher1 : OK [exec] Zookeeper_watchers::testDefaultSessionWatcher2 : OK [exec] Zookeeper_watchers::testObjectSessionWatcher1 : OK [exec] Zookeeper_watchers::testObjectSessionWatcher2 : OK [exec] Zookeeper_watchers::testNodeWatcher1 : OK [exec] Zookeeper_watchers::testChildWatcher1 : OK [exec] Zookeeper_watchers::testChildWatcher2 : OK [exec] [exec] /home/hudson/hudson-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestClient.cc:289: Assertion: equality assertion failed [Expected: -101, Actual : -4] [exec] Failures !!! [exec] Run: 32 Failure total: 1 Failures: 1 Errors: 0 [exec] make: *** [run-check] Error 1 and the same from 376 (yesterday's build). These are failing in TestClient (specifically testAsyncWatcherAutoReset). The error here is that a stat completion callback is getting called with ZCONNECTIONLOSS, but is expecting to see ZNONODE, and the assert is failing. This test runs fine for me locally, so is the problem a heavily loaded Hudson, causing the connection loss? Similarly the failed build you point to, 371, fails TestClientRetry with a broken pipe error which to my novice eye sounds a bit like something falling over under load. It looks to me right now like the TestClientRetry code needs improving, but is benign as it should only cause false positives, and we need to understand the reasons why TestClient is failing. Does that sound right?
          Hide
          Patrick Hunt added a comment -

          Henry, I'm also working on ZOOKEEPER-473 which will fix a number of the
          intermittent failures on hudson. however 473 is separate from this issue.
          Let's nail down these, then see what's left and address those in subsequent
          jiras. thanks.

          Show
          Patrick Hunt added a comment - Henry, I'm also working on ZOOKEEPER-473 which will fix a number of the intermittent failures on hudson. however 473 is separate from this issue. Let's nail down these, then see what's left and address those in subsequent jiras. thanks.
          Hide
          Patrick Hunt added a comment -

          Hm, I don't have access to run code on hudson, Mahadev does though, I'll check with him.

          Show
          Patrick Hunt added a comment - Hm, I don't have access to run code on hudson, Mahadev does though, I'll check with him.
          Hide
          Mahadev konar added a comment -

          i just ran on the tests on vesta manually and it passes on the machine. It seems to pass for me on the machine :

          I see this on bulild 371 -

           [exec] make: *** [run-check] Broken pipe
           [exec] Running Zookeeper_clientretry::testRetry
          

          this looks more like the c api generating SIGPIPE and the c tests crashing on that. All we will need to do is ignore the SIGPIPE. Let me check the others for what kind of error they have.

          Show
          Mahadev konar added a comment - i just ran on the tests on vesta manually and it passes on the machine. It seems to pass for me on the machine : I see this on bulild 371 - [exec] make: *** [run-check] Broken pipe [exec] Running Zookeeper_clientretry::testRetry this looks more like the c api generating SIGPIPE and the c tests crashing on that. All we will need to do is ignore the SIGPIPE. Let me check the others for what kind of error they have.
          Hide
          Mahadev konar added a comment -

          each time it seems we have a different reason for tests failing. This is bad.
          On 377 build
          http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/377/

          the tests fail because of failure in FLENewEpochTest, other times its mostly the c tests.

          Show
          Mahadev konar added a comment - each time it seems we have a different reason for tests failing. This is bad. On 377 build http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/377/ the tests fail because of failure in FLENewEpochTest, other times its mostly the c tests.
          Hide
          Patrick Hunt added a comment -

          ZOOKEEPER-473 should fix a lot of this, esp the *LE tests (socket reuse)

          Show
          Patrick Hunt added a comment - ZOOKEEPER-473 should fix a lot of this, esp the *LE tests (socket reuse)
          Hide
          Mahadev konar added a comment -

          the reason the tests are failing is because the servers are not able to start for cppunit tests. the following is the exception on servers run via ccppunit tests -

          CLOVER] FATAL ERROR: Clover could not be initialised. Are you sure you have Clover in the runtime classpath? (class java.lang.NoClassDefFoundError:com_cenqua_clover/CloverVersionInfo)
          Exception in thread "main" java.lang.NoClassDefFoundError: com_cenqua_clover/CoverageRecorder
              at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:48)
          Caused by: java.lang.ClassNotFoundException: com_cenqua_clover.CoverageRecorder
              at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
              at java.security.AccessController.doPrivileged(Native Method)
              at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
              at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
              at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
              at java.lang.ClassLoader.loadClass(ClassLoader
          
          Show
          Mahadev konar added a comment - the reason the tests are failing is because the servers are not able to start for cppunit tests. the following is the exception on servers run via ccppunit tests - CLOVER] FATAL ERROR: Clover could not be initialised. Are you sure you have Clover in the runtime classpath? (class java.lang.NoClassDefFoundError:com_cenqua_clover/CloverVersionInfo) Exception in thread "main" java.lang.NoClassDefFoundError: com_cenqua_clover/CoverageRecorder at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:48) Caused by: java.lang.ClassNotFoundException: com_cenqua_clover.CoverageRecorder at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang. ClassLoader .loadClass( ClassLoader .java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang. ClassLoader .loadClass( ClassLoader
          Hide
          Giridharan Kesavan added a comment -

          this should fix the clover classpath issue

          Show
          Giridharan Kesavan added a comment - this should fix the clover classpath issue
          Hide
          Patrick Hunt added a comment -

          this patch fixed the problem on hudson, thanks Giri!

          Show
          Patrick Hunt added a comment - this patch fixed the problem on hudson, thanks Giri!

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development