Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1069

Deadlock between KafkaSystemProducer and KafkaProducer from kafka-clients lib

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      We have identified one deadlock scenario between the main thread that calls KafkaSystemProducer.close() vs the KafkaProducer client lib's network thread that calls the callback function within KafkaSystemProducer.send().

      The scenario is the following:

      1. SamzaContainer main thread caught an exception from previous commit and container initiated shutdown, which calls KafkaSystemProducer.stop(), grabbing the synchronized producerLock in KafkaSystemProducer and call KafkaProducer.flush() to wait for all pending requests to be done.
      2. KafkaProducer network I/O thread then calls KafkaSystemProducer’s callback function (in RecordBatch.done()), which is waiting on the same producerLock in KafkaSystemProducer before it can return and call producerFuture.done() and release the CountDownLatch that the main thread KafkaSystemProducer.close() is waiting on. Hence, deadlock!

      We need to make sure the KafkaSystemProducer.close() won't have race condition w/ the callbacks triggered by the KafkaProducer's network thread.

        Issue Links

          Activity

          Hide
          nickpan47 Yi Pan (Data Infrastructure) added a comment -

          Merged and submitted. Thanks!

          Show
          nickpan47 Yi Pan (Data Infrastructure) added a comment - Merged and submitted. Thanks!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/samza/pull/37

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/samza/pull/37
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user xinyuiscool opened a pull request:

          https://github.com/apache/samza/pull/37

          SAMZA-1069: Fix Deadlock between KafkaSystemProducer and KafkaProducer

          Moving the producer.close() and sources.flush() outside the lock so it won't have race condition with the kafka network thread callbacks.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/xinyuiscool/samza SAMZA-1069

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/samza/pull/37.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #37


          commit 7dad5ad19df01911c109d1a6f3667558eda4d8e0
          Author: Xinyu Liu <xiliu@xiliu-ld.linkedin.biz>
          Date: 2016-12-23T20:21:34Z

          SAMZA-1069: Fix Deadlock between KafkaSystemProducer and KafkaProducer


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user xinyuiscool opened a pull request: https://github.com/apache/samza/pull/37 SAMZA-1069 : Fix Deadlock between KafkaSystemProducer and KafkaProducer Moving the producer.close() and sources.flush() outside the lock so it won't have race condition with the kafka network thread callbacks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xinyuiscool/samza SAMZA-1069 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/37.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #37 commit 7dad5ad19df01911c109d1a6f3667558eda4d8e0 Author: Xinyu Liu <xiliu@xiliu-ld.linkedin.biz> Date: 2016-12-23T20:21:34Z SAMZA-1069 : Fix Deadlock between KafkaSystemProducer and KafkaProducer
          Hide
          xinyu Xinyu Liu added a comment -

          Looks like the flush() is thread safe by itself, so there is no reason why it needs to hold the producerLock. By moving it outside should resolve this issue. I will put up a patch. Thanks for the thorough investigation!

          Show
          xinyu Xinyu Liu added a comment - Looks like the flush() is thread safe by itself, so there is no reason why it needs to hold the producerLock. By moving it outside should resolve this issue. I will put up a patch. Thanks for the thorough investigation!

            People

            • Assignee:
              xinyu Xinyu Liu
              Reporter:
              nickpan47 Yi Pan (Data Infrastructure)
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development