[HELIX-264] fix zkclient#close() bug - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.2-incubating
Component/s: None
Labels:
None

Sprint:
Sprint #4 10/2 - 10/16

Description

When the flapping is detected, we are in the zkclient event thread context and we are calling zkclient.close() from its own event thread. Here is the ZkClient#close():

public void close() throws ZkInterruptedException {
if (_connection == null)

{ return; }

LOG.debug("Closing ZkClient...");
getEventLock().lock();
try

{ setShutdownTrigger(true); _eventThread.interrupt(); _eventThread.join(2000); _connection.close(); _connection = null; }

catch (InterruptedException e)

{ throw new ZkInterruptedException(e); }

finally

{ getEventLock().unlock(); }

LOG.debug("Closing ZkClient...done");
}

_eventThread.interrupt(); <-- will set interrupt status of _eventThread which is in fact the currentThread.
_eventThread.join(2000); <-- will throw InterruptedException because currentThread has been interrupted.
_connection.close(); <-- SKIPPED!!!

So if flapping happens, we are calling ZkHelixManager#disconnectInternal(), which will always interrupt ZkClient#_eventThread but never disconnect the zk connection. This is probably a zkclient bug that we should never call zkclient.close() from its own event thread context.

fix steps:
1) workaround for this bug
2) add test cases for flapping detection
3) explore the possibility to have controller detect flapping participants and disable them (may via querying zk-server jmx metrics)

Attachments

Activity

People

Assignee:: Zhen Zhang

Reporter:: Zhen Zhang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Oct/13 18:22

Updated:: 12/Nov/13 23:40

Resolved:: 09/Oct/13 19:47