Kafka
  1. Kafka
  2. KAFKA-989

Race condition shutting down high-level consumer results in spinning background thread

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Environment:
      Ubuntu Linux x64

      Description

      Running an application that uses the Kafka client under load, can often hit this issue within a few hours.

      High-level consumers come and go over this application's lifecycle, but there are a variety of defenses that ensure each high-level consumer lasts several seconds before being shutdown. Nevertheless, some race is causing this background thread to continue long after the ZKClient it is using has been disconnected. Since the thread was spawned by a consumer that has already been shutdown, the application has no way to find this thread and stop it.

      Reported on the users-kafka mailing list 6/25 as "0.8 throwing exception 'Failed to find leader' and high-level consumer fails to make progress".

      The only remedy is to shutdown the application and restart it. Externally detecting that this state has occurred is not pleasant: need to grep log for repeated occurrences of the same exception.

      Stack trace:

      Failed to find leader for Set([topic6,0]): java.lang.NullPointerException
      at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
      at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
      at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
      at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
      at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
      at kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
      at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
      at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
      at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)

        Activity

        Hide
        Phil Hargett added a comment -

        This patch may minimize the issue, as there does seem to be a race between startConnections / stopConnections in ConsumerFetcherManager and the doWork method of the inner LeaderFinderThread class.

        It seems that a thread could be started (from startConnections) but shutdown could happen (from stopConnections) even before the leader thread actually even started to do work.

        Show
        Phil Hargett added a comment - This patch may minimize the issue, as there does seem to be a race between startConnections / stopConnections in ConsumerFetcherManager and the doWork method of the inner LeaderFinderThread class. It seems that a thread could be started (from startConnections) but shutdown could happen (from stopConnections) even before the leader thread actually even started to do work.
        Hide
        Phil Hargett added a comment - - edited

        Here's the patch: KAFKA-989-failed-to-find-leader.patch.

        Show
        Phil Hargett added a comment - - edited Here's the patch: KAFKA-989 -failed-to-find-leader.patch.
        Hide
        Phil Hargett added a comment -

        Not good enough. Deadlocks because ShutdownableThread.shutdown grabs another lock.

        Show
        Phil Hargett added a comment - Not good enough. Deadlocks because ShutdownableThread.shutdown grabs another lock.
        Hide
        Phil Hargett added a comment -

        When in doubt about how to fix a locking issue...add another lock.

        While the real race here involves startConnections / stopConnections in ConsumerFetcherManager, the real trigger for such races appears to be the lack of protection in the shutdown and rebalance operations on ZookeeperConsumerConnector. There is nothing to prevent a rebalance while a shutdown is in progress, and it would appear that could trigger the race in ConsumerFetcherManager.

        The patch I'm attaching (see KAFKA-989-failed-to-find-leader-patch2.patch) adds a shutdown lock grabbed first in both shutdown() and in the run method of the ZKRebalancerListener. This should prevent a rebalance from happening on a consumer that has already shutdown. This prevents the fetcher or the zkclient from being in intermediate states, and thus should prevent the race.

        Show
        Phil Hargett added a comment - When in doubt about how to fix a locking issue...add another lock. While the real race here involves startConnections / stopConnections in ConsumerFetcherManager, the real trigger for such races appears to be the lack of protection in the shutdown and rebalance operations on ZookeeperConsumerConnector. There is nothing to prevent a rebalance while a shutdown is in progress, and it would appear that could trigger the race in ConsumerFetcherManager. The patch I'm attaching (see KAFKA-989 -failed-to-find-leader-patch2.patch) adds a shutdown lock grabbed first in both shutdown() and in the run method of the ZKRebalancerListener. This should prevent a rebalance from happening on a consumer that has already shutdown. This prevents the fetcher or the zkclient from being in intermediate states, and thus should prevent the race.
        Hide
        Jun Rao added a comment -

        Thanks for the patch. I think it addresses one particular issue: When the consumer connector is shut down, there could still be an outstanding rebalance that uses zkclient which is already set to null. I am not sure if it addresses the problem that you hit though. Some comments:

        1. Since syncedRebalance() is called in multiple places, so the shutdown lock should be checked inside syncedRebalance(). Since there is already a rebalanceLock in syncedRebalance(), perhaps shutdown() can just synchronize on that.

        Show
        Jun Rao added a comment - Thanks for the patch. I think it addresses one particular issue: When the consumer connector is shut down, there could still be an outstanding rebalance that uses zkclient which is already set to null. I am not sure if it addresses the problem that you hit though. Some comments: 1. Since syncedRebalance() is called in multiple places, so the shutdown lock should be checked inside syncedRebalance(). Since there is already a rebalanceLock in syncedRebalance(), perhaps shutdown() can just synchronize on that.
        Hide
        Phil Hargett added a comment -

        Thanks for the feedback! Working on a patch that incorporates your suggestions now.

        FWIW, I think this will indirectly address my original situation. I think the reason the original stack trace has occurred is because a rebalance has started a fetcher with a LeaderFinderThread while the consumer itself is in the process of being shutdown. The LeaderFinderThread is left with a stale ZkClient, and has no other way to know that it's should shutdown. Since the fetcher has already set its reference to null its reference to the LeaderFinderThread, the running state of the thread will never be changed to shut the thread off.

        I'm stuck trying to find an acceptable solution in LeaderFinderThread; this change in ZooKeeperConsumerConnector may have to do until then.

        Will post new patch once I have it coded and tested.

        Show
        Phil Hargett added a comment - Thanks for the feedback! Working on a patch that incorporates your suggestions now. FWIW, I think this will indirectly address my original situation. I think the reason the original stack trace has occurred is because a rebalance has started a fetcher with a LeaderFinderThread while the consumer itself is in the process of being shutdown. The LeaderFinderThread is left with a stale ZkClient, and has no other way to know that it's should shutdown. Since the fetcher has already set its reference to null its reference to the LeaderFinderThread, the running state of the thread will never be changed to shut the thread off. I'm stuck trying to find an acceptable solution in LeaderFinderThread; this change in ZooKeeperConsumerConnector may have to do until then. Will post new patch once I have it coded and tested.
        Hide
        Jun Rao added a comment -

        Hmm, in the shutdown logic of consumer connector, we set zkclient to null the last. So, all fetchers and the leader finder thread should have been stopped when zkclient is null.

        Show
        Jun Rao added a comment - Hmm, in the shutdown logic of consumer connector, we set zkclient to null the last. So, all fetchers and the leader finder thread should have been stopped when zkclient is null.
        Hide
        Phil Hargett added a comment -

        Yes, but my working hypothesis is that because there are at least 2 sets of races (in consumer connector syncedRebalance/shutdown, then in ConsumerFetcherManager startConnections/stopConnections), it is actually possible to have a LeaderFinderThread still running that has not been shutdown, even though its consumer has--because a stopConnections call completed before a startConnections call finished. So there's a started leader finder thread, but its ZkClient has been closed.

        The key, I think, is that there is no guarantee that while the consumer connector is shutting down a rebalance event won't actually startup another leader finder thread (by starting fetchers again).

        I believe the race in ConsumerFetcherManager is not likely to happen, if the race in ZookeeperConsumerConnector is fixed instead. Thus I avoid fixing the harder race by fixing an easier one that may be its only trigger (at present).

        Show
        Phil Hargett added a comment - Yes, but my working hypothesis is that because there are at least 2 sets of races (in consumer connector syncedRebalance/shutdown, then in ConsumerFetcherManager startConnections/stopConnections), it is actually possible to have a LeaderFinderThread still running that has not been shutdown, even though its consumer has--because a stopConnections call completed before a startConnections call finished. So there's a started leader finder thread, but its ZkClient has been closed. The key, I think, is that there is no guarantee that while the consumer connector is shutting down a rebalance event won't actually startup another leader finder thread (by starting fetchers again). I believe the race in ConsumerFetcherManager is not likely to happen, if the race in ZookeeperConsumerConnector is fixed instead. Thus I avoid fixing the harder race by fixing an easier one that may be its only trigger (at present).
        Hide
        Phil Hargett added a comment -

        Changed to reuse the existing rebalanceLock in shutdown() rather than add in yet another lock. Modified both shutdown and syncedRebalance accordingly.

        Show
        Phil Hargett added a comment - Changed to reuse the existing rebalanceLock in shutdown() rather than add in yet another lock. Modified both shutdown and syncedRebalance accordingly.
        Hide
        Phil Hargett added a comment -

        I also think I've convinced myself that while the race in ConsumerFetcherManager is not ideal, the real resource to protect is not the leader finder thread but the shared ZkClient instance--which is managed by ZookeeperConsumerConnector, where these fixes are made.

        By reducing the races in the consumer connector, then we're less likely to mismanage the ZkClient.

        Show
        Phil Hargett added a comment - I also think I've convinced myself that while the race in ConsumerFetcherManager is not ideal, the real resource to protect is not the leader finder thread but the shared ZkClient instance--which is managed by ZookeeperConsumerConnector, where these fixes are made. By reducing the races in the consumer connector, then we're less likely to mismanage the ZkClient.
        Hide
        Jun Rao added a comment -

        Thanks for patch v3. Committed to 0.8.

        Show
        Jun Rao added a comment - Thanks for patch v3. Committed to 0.8.
        Hide
        Phil Hargett added a comment -

        Thank you!

        Show
        Phil Hargett added a comment - Thank you!

          People

          • Assignee:
            Phil Hargett
            Reporter:
            Phil Hargett
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development