Details

    • Sub-task
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.11.0.0
    • 0.11.0.0
    • None

    Description

      I observed the following in the broker logs:

      [2017-06-01 04:10:36,664] ERROR [Replica Manager on Broker 1]: Error processing append operation on partition __transaction_state-37 (kafka.server.ReplicaManager)
      [2017-06-01 04:10:36,667] ERROR [TxnMarkerSenderThread-1]: Error due to (kafka.common.InterBrokerSendThread)
      java.lang.StackOverflowError
              at java.security.AccessController.doPrivileged(Native Method)
              at java.io.PrintWriter.<init>(PrintWriter.java:116)
              at java.io.PrintWriter.<init>(PrintWriter.java:100)
              at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:58)
              at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
              at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
              at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
              at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:369)
              at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
              at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
              at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
              at org.apache.log4j.Category.callAppenders(Category.java:206)
              at org.apache.log4j.Category.forcedLog(Category.java:391)
              at org.apache.log4j.Category.error(Category.java:322)
              at kafka.utils.Logging$class.error(Logging.scala:105)
              at kafka.server.ReplicaManager.error(ReplicaManager.scala:122)
              at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:557)
              at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:505)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
              at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
              at scala.collection.AbstractTraversable.map(Traversable.scala:104)
              at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:505)
              at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:346)
              at kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply$mcV$sp(TransactionStateManager.scala:589)
              at kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply(TransactionStateManager.scala:570)
              at kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1.apply(TransactionStateManager.scala:570)
              at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
              at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:219)
              at kafka.coordinator.transaction.TransactionStateManager.appendTransactionToLog(TransactionStateManager.scala:564)
              at kafka.coordinator.transaction.TransactionMarkerChannelManager.kafka$coordinator$transaction$TransactionMarkerChannelManager$$retryAppendCallback$1(TransactionMarkerChannelManager.scala:225)
              at kafka.coordinator.transaction.TransactionMarkerChannelManager$$anonfun$kafka$coordinator$transaction$TransactionMarkerChannelManager$$retryAppendCallback$1$4.apply(TransactionMarkerChannelManager.scala:225)
              at kafka.coordinator.transaction.TransactionMarkerChannelManager$$anonfun$kafka$coordinator$transaction$TransactionMarkerChannelManager$$retryAppendCallback$1$4.apply(TransactionMarkerChannelManager.scala:225)
              at kafka.coordinator.transaction.TransactionStateManager.kafka$coordinator$transaction$TransactionStateManager$$updateCacheCallback$1(TransactionStateManager.scala:561)
              at kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1$$anonfun$apply$mcV$sp$4.apply(TransactionStateManager.scala:595)
              at kafka.coordinator.transaction.TransactionStateManager$$anonfun$appendTransactionToLog$1$$anonfun$apply$mcV$sp$4.apply(TransactionStateManager.scala:595)
              at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:373)
       

      Attachments

        1. KAFKA-5357.tar.gz
          3.15 MB
          Apurva Mehta

        Activity

          apurva Apurva Mehta added a comment -

          The upshot of this ticket is that the commit of a transaction will simply get stuck, because if the markers can't be written, the transaction will stay in a `PREPARE_COMMIT` state, and the producer retries will constantly get `CONCURRENT_TRANSACTION` errors.

          apurva Apurva Mehta added a comment - The upshot of this ticket is that the commit of a transaction will simply get stuck, because if the markers can't be written, the transaction will stay in a `PREPARE_COMMIT` state, and the producer retries will constantly get `CONCURRENT_TRANSACTION` errors.
          apurva Apurva Mehta added a comment -

          The bigger problem is that once the InterBrokerSend thread dies, it is not started up again. So any open transactions will sit in their `PrepareXX` state forever.

          guozhang please let me know your thoughts on this hypothesis.

          apurva Apurva Mehta added a comment - The bigger problem is that once the InterBrokerSend thread dies, it is not started up again. So any open transactions will sit in their `PrepareXX` state forever. guozhang please let me know your thoughts on this hypothesis.
          apurva Apurva Mehta added a comment - - edited

          The stack over flow error is essentially due to the fact that we are in a tight recursive loop when handling transaction marker write completion. Here is what happens:

          1. We send the marker
          2. Upon success, we try to write the updated transaction metadata to the transaction log to move it form PrepareXX to CompletedXX state.
          3. If this append fails, we do a recursive call to `addTransactionToLog` with the same metadata update, ad infinitum.
          4. When the there are broker bounces and not enough replicas are available, this can happen in a tight loop for several 10's of seconds, resulting in a stack overflow error.

          One fix is to back of and retry rather than doing the tight loop – that is what the client does.

          apurva Apurva Mehta added a comment - - edited The stack over flow error is essentially due to the fact that we are in a tight recursive loop when handling transaction marker write completion. Here is what happens: We send the marker Upon success, we try to write the updated transaction metadata to the transaction log to move it form PrepareXX to CompletedXX state. If this append fails, we do a recursive call to `addTransactionToLog` with the same metadata update, ad infinitum. When the there are broker bounces and not enough replicas are available, this can happen in a tight loop for several 10's of seconds, resulting in a stack overflow error. One fix is to back of and retry rather than doing the tight loop – that is what the client does.
          ijuma Ismael Juma added a comment -

          Even if we do that, it's still possible to overflow the stack right? Could we replace the recursion with a loop?

          ijuma Ismael Juma added a comment - Even if we do that, it's still possible to overflow the stack right? Could we replace the recursion with a loop?
          ijuma Ismael Juma added a comment -

          The other option, which is better if possible, is to make the calls tail recursive. I suspect this may not be easy here although I haven't looked at the code.

          ijuma Ismael Juma added a comment - The other option, which is better if possible, is to make the calls tail recursive. I suspect this may not be easy here although I haven't looked at the code.
          githubbot ASF GitHub Bot added a comment -

          GitHub user dguy opened a pull request:

          https://github.com/apache/kafka/pull/3192

          KAFKA-5357: StackOverFlow error in transaction coordinator

          Replace recursion in `TransactionMarkerChannelManager#appendToLogCallback` with loop

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/dguy/kafka kafka-5357

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/kafka/pull/3192.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3192


          commit 96af68bd058aae80c180c81f4f53bc00b59d833b
          Author: Damian Guy <damian.guy@gmail.com>
          Date: 2017-06-01T14:35:52Z

          use loop rather than recursion


          githubbot ASF GitHub Bot added a comment - GitHub user dguy opened a pull request: https://github.com/apache/kafka/pull/3192 KAFKA-5357 : StackOverFlow error in transaction coordinator Replace recursion in `TransactionMarkerChannelManager#appendToLogCallback` with loop You can merge this pull request into a Git repository by running: $ git pull https://github.com/dguy/kafka kafka-5357 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3192 commit 96af68bd058aae80c180c81f4f53bc00b59d833b Author: Damian Guy <damian.guy@gmail.com> Date: 2017-06-01T14:35:52Z use loop rather than recursion
          guozhang Guozhang Wang added a comment -

          Issue resolved by pull request 3192
          https://github.com/apache/kafka/pull/3192

          guozhang Guozhang Wang added a comment - Issue resolved by pull request 3192 https://github.com/apache/kafka/pull/3192
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/kafka/pull/3192

          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/3192
          githubbot ASF GitHub Bot added a comment -

          GitHub user guozhangwang opened a pull request:

          https://github.com/apache/kafka/pull/3242

          KAFKA-5357 follow-up: Yammer metrics, not Kafka Metrics

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/guozhangwang/kafka K5357-yammer-metrics

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/kafka/pull/3242.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3242


          commit 7e3e872ee165035ec4a86722ade30d88bfd459bb
          Author: Guozhang Wang <wangguoz@gmail.com>
          Date: 2017-06-06T05:26:45Z

          replace Kafka metrics with Yammer metrics

          commit b5db596b0088056a03cf6ae494a6f863a7bef888
          Author: Guozhang Wang <wangguoz@gmail.com>
          Date: 2017-06-06T05:32:14Z

          minor fix


          githubbot ASF GitHub Bot added a comment - GitHub user guozhangwang opened a pull request: https://github.com/apache/kafka/pull/3242 KAFKA-5357 follow-up: Yammer metrics, not Kafka Metrics You can merge this pull request into a Git repository by running: $ git pull https://github.com/guozhangwang/kafka K5357-yammer-metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3242.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3242 commit 7e3e872ee165035ec4a86722ade30d88bfd459bb Author: Guozhang Wang <wangguoz@gmail.com> Date: 2017-06-06T05:26:45Z replace Kafka metrics with Yammer metrics commit b5db596b0088056a03cf6ae494a6f863a7bef888 Author: Guozhang Wang <wangguoz@gmail.com> Date: 2017-06-06T05:32:14Z minor fix
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/kafka/pull/3242

          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/3242

          People

            damianguy Damian Guy
            apurva Apurva Mehta
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: