ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1065

Possible timing issue in embedded server

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 3.3.3
    • Fix Version/s: None
    • Component/s: java client, server
    • Labels:
      None
    • Environment:

      Windows 7, 32bit, Core2 Duo T9300, JDK 1.6.0_24, ZooKeeper data on 500GB hybrid Seagate HDD with 4GB SSD cache

      Description

      I have an application that uses ZooKeeper. There is an ensemble in
      production. But in order to simplify development the application will
      start an embedded ZooKeeper server when started in development mode. We
      are experiencing a timing issue with ZooKeeper 3.3.3 and I was wondering
      if this is allowed to be happen or if we did something wrong when
      starting the embedded server.

      Basically, we have a watch registered using an #exists call and watch
      code like the following.

      @Override
      public void process(final WatchedEvent event) {
        switch (event.getType()) {
          ...
          case NodeCreated:
            pathCreated(event.getPath());
            break;
          ...
        }
      }
      
      @Override
      protected void pathCreated(final String path) {
        // process events only for this node
        if (!isMyPath(path))
          return;
        try {
          loadNode(); // calls zk.getData(String, Watcher, Stat)
        } catch (final Exception e) {
          // got NoNodeException here (but not when debugging)
          log(..., e)
        }
      }
      

      From inspecting the logs we noticed a NoNodeException. When setting
      breakpoints on #loadNode and stepping through we don't get the
      exception. But when setting a breakpoint on #log only we got a hit and
      could confirm the issue this way.

      The path is actually some levels deep. All the parent paths don't exist
      either so they are created as well. However, no exception is thrown fro
      them. The sequence is as follows.

      /l1  --> watch triggered, getData, no exception
      /l1/l2  --> watch triggered, getData, no exception
      /l1/l2/l3  --> watch triggered, getData, no exception
      /l1/l2/l3/l4  --> watch triggered, getData, no exception
      /l1/l2/l3/l4/l5  --> watch triggered, getData, no exception
      /l1/l2/l3/l4/l5/l6  --> watch triggered, getData, NoNodeException
      

      The only difference is that all paths up to including l5 do not actually
      have any data. Only l6 has some data. Could there be some latency issues?

      For completeness, the embedded server is started as follows.

      // disable LOG4J JMX stuff
      System.setProperty("zookeeper.jmx.log4j.disable", Boolean.TRUE.toString());
      
      // get directories
      final File dataDir = new File(config.getDataLogDir());
      final File snapDir = new File(config.getDataDir());
      
      // clean old logs
      PurgeTxnLog.purge(dataDir, snapDir, 3);
      
      // create standalone server
      zkServer = new ZooKeeperServer();
      zkServer.setTxnLogFactory(new FileTxnSnapLog(dataDir, snapDir));
      zkServer.setTickTime(config.getTickTime());
      zkServer.setMinSessionTimeout(config.getMinSessionTimeout());
      zkServer.setMaxSessionTimeout(config.getMaxSessionTimeout());
      
      factory = new NIOServerCnxn.Factory(config.getClientPortAddress(),
      config.getMaxClientCnxns());
      
      // start server
      LOG.info("Starting ZooKeeper standalone server.");
      try {
        factory.startup(zkServer);
      } catch (final InterruptedException e) {
        LOG.warn("Interrupted during server start.", e);
        Thread.currentThread().interrupt();
      }
      
      1. zookeeper-nonode-issue.log
        85 kB
        Gunnar Wagenknecht

        Activity

        Hide
        Gunnar Wagenknecht added a comment -

        I attached the complete DEBUG log of the procedure. This includes the behavior of our application start procedure. The exception is logged in line 190 WARN o.e.g.c.i.p.ZooKeeperBasedPreferences - Error refreshing node ....

        Show
        Gunnar Wagenknecht added a comment - I attached the complete DEBUG log of the procedure. This includes the behavior of our application start procedure. The exception is logged in line 190 WARN o.e.g.c.i.p.ZooKeeperBasedPreferences - Error refreshing node ... .
        Hide
        Gunnar Wagenknecht added a comment -
        Am 19.05.2011 21:14, schrieb Patrick Hunt:
        > Hi Gunnar this is great detective work. It certainly sounds like it
        > might be some timing issue or possible bug in ZK exposed by this
        > embedded case. A few questions:
        > 
        > 1) in this dev/embedded case you only have a single zk server, correct?
        

        Single ZK server.

        > 2) you have 2 clients in this case, one creating the znode and one
        > watching (vs say a single client doing both)
        

        It's a single client which connects to "localhost" single server.

        Show
        Gunnar Wagenknecht added a comment - Am 19.05.2011 21:14, schrieb Patrick Hunt: > Hi Gunnar this is great detective work. It certainly sounds like it > might be some timing issue or possible bug in ZK exposed by this > embedded case. A few questions: > > 1) in this dev/embedded case you only have a single zk server, correct? Single ZK server. > 2) you have 2 clients in this case, one creating the znode and one > watching (vs say a single client doing both) It's a single client which connects to "localhost" single server.
        Hide
        Patrick Hunt added a comment -

        Hi Gunnar, I believe I see something from the logs:

        1) goto line 190 in the log, time index 09:00:59.357
        2) search backward in the log file for the znode you are getting
        "nonode" on, this:
        /gyrex/prefs/cloud/org.eclipse.gyrex.http/applications/org.eclipse.gyrex.http.equinoxhttpservice.application
        3) time index is now 09:00:58.968 (the most recent earlier operation
        on this znode), notice the operation being performed on the znode -
        it's explicitly being deleted by the client.

        This (3) is the cause of the log message from (1).

        So from what I see in the log it looks like expected behavior from the
        ZK view of the world. Does this make sense to you or am I missing
        something?

        Show
        Patrick Hunt added a comment - Hi Gunnar, I believe I see something from the logs: 1) goto line 190 in the log, time index 09:00:59.357 2) search backward in the log file for the znode you are getting "nonode" on, this: /gyrex/prefs/cloud/org.eclipse.gyrex.http/applications/org.eclipse.gyrex.http.equinoxhttpservice.application 3) time index is now 09:00:58.968 (the most recent earlier operation on this znode), notice the operation being performed on the znode - it's explicitly being deleted by the client. This (3) is the cause of the log message from (1). So from what I see in the log it looks like expected behavior from the ZK view of the world. Does this make sense to you or am I missing something?
        Hide
        Gunnar Wagenknecht added a comment -

        Thanks very much. I didn't understand the logs properly. That help me a lot in setting the right breakpoint and understanding the issue further. It's within my application, well, sort of.

        Show
        Gunnar Wagenknecht added a comment - Thanks very much. I didn't understand the logs properly. That help me a lot in setting the right breakpoint and understanding the issue further. It's within my application, well, sort of.
        Hide
        Patrick Hunt added a comment -

        No worries (it's overwhelming sometimes, took me a bit to figure it out ). Good hunting and feel free to reach out if you run into another issue.

        Show
        Patrick Hunt added a comment - No worries (it's overwhelming sometimes, took me a bit to figure it out ). Good hunting and feel free to reach out if you run into another issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Gunnar Wagenknecht
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development