Uploaded image for project: 'Apache Helix'
  1. Apache Helix
  2. HELIX-264

fix zkclient#close() bug

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 0.6.2-incubating
    • None
    • None
    • Sprint #4 10/2 - 10/16

    Description

      When the flapping is detected, we are in the zkclient event thread context and we are calling zkclient.close() from its own event thread. Here is the ZkClient#close():

      public void close() throws ZkInterruptedException {
      if (_connection == null)

      { return; }

      LOG.debug("Closing ZkClient...");
      getEventLock().lock();
      try

      { setShutdownTrigger(true); _eventThread.interrupt(); _eventThread.join(2000); _connection.close(); _connection = null; }

      catch (InterruptedException e)

      { throw new ZkInterruptedException(e); }

      finally

      { getEventLock().unlock(); }

      LOG.debug("Closing ZkClient...done");
      }

      _eventThread.interrupt(); <-- will set interrupt status of _eventThread which is in fact the currentThread.
      _eventThread.join(2000); <-- will throw InterruptedException because currentThread has been interrupted.
      _connection.close(); <-- SKIPPED!!!

      So if flapping happens, we are calling ZkHelixManager#disconnectInternal(), which will always interrupt ZkClient#_eventThread but never disconnect the zk connection. This is probably a zkclient bug that we should never call zkclient.close() from its own event thread context.

      fix steps:
      1) workaround for this bug
      2) add test cases for flapping detection
      3) explore the possibility to have controller detect flapping participants and disable them (may via querying zk-server jmx metrics)

      Attachments

        Activity

          People

            dafu Zhen Zhang
            dafu Zhen Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: