Kafka
  1. Kafka
  2. KAFKA-860

Replica fetcher thread errors out and dies during rolling bounce of cluster

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: replication
    • Labels:

      Description

      2013/04/10 20:04:32.071 ERROR [ReplicaFetcherThread] [ReplicaFetcherThread-0-272] [kafka] [] [ReplicaFetcherThread-0-272], Error due to
      kafka.common.KafkaException: error processing data for topic PageViewEvent partititon 3 offset 2482625623
      at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4.apply(AbstractFetcherThread.scala:135)
      at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4.apply(AbstractFetcherThread.scala:113)
      at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
      at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:113)
      at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:89)
      at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
      Caused by: java.lang.RuntimeException: Offset mismatch: fetched offset = 2482625623, log end offset = 2482625631.
      at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:49)
      at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4.apply(AbstractFetcherThread.scala:132)
      ... 5 more

      This causes replica fetcher thread to shut down

      1. kafka-860-v2.patch
        10 kB
        Neha Narkhede
      2. kafka-860-v1.patch
        5 kB
        Neha Narkhede

        Issue Links

          Activity

          Hide
          Neha Narkhede added a comment -

          This is caused by a race condition between the old leader's local append to the log and the new follower's log truncation. Specifically, the following causes the bug-

          1. Current leader receives a produce request.
          2. Broker receives leader and isr request making it a follower now
          3. Broker starts become follower and truncates log
          4. Broker, not knowing it is not the leader anymore, continues with the produce request and appends some data to the log
          5. Become follower starts a fetcher with the old log end offset

          At step 5, it runs into the error

          Show
          Neha Narkhede added a comment - This is caused by a race condition between the old leader's local append to the log and the new follower's log truncation. Specifically, the following causes the bug- 1. Current leader receives a produce request. 2. Broker receives leader and isr request making it a follower now 3. Broker starts become follower and truncates log 4. Broker, not knowing it is not the leader anymore, continues with the produce request and appends some data to the log 5. Become follower starts a fetcher with the old log end offset At step 5, it runs into the error
          Hide
          Neha Narkhede added a comment -

          The root cause is that during produce request handling, we acquire different locks to check if the broker is a leader and then append messages atomically. The fix is to move the append to Partition, so that either it is the leader and it finishes the append or it rejects the produce request since it is becoming a follower. No interleaving should happen.

          Show
          Neha Narkhede added a comment - The root cause is that during produce request handling, we acquire different locks to check if the broker is a leader and then append messages atomically. The fix is to move the append to Partition, so that either it is the leader and it finishes the append or it rejects the produce request since it is becoming a follower. No interleaving should happen.
          Hide
          Neha Narkhede added a comment -

          Jun made a good point when we discussed this offline. The solution is correct but there is a performance hit. Basically, the only requirement is to have become-leader/become-follower/update-isr block the appends. But we shouldn't let 2 appends block each other. Implemented that using a read-write lock

          Show
          Neha Narkhede added a comment - Jun made a good point when we discussed this offline. The solution is correct but there is a performance hit. Basically, the only requirement is to have become-leader/become-follower/update-isr block the appends. But we shouldn't let 2 appends block each other. Implemented that using a read-write lock
          Hide
          Jun Rao added a comment -

          Thanks for patch v2. This turns out to be a bit more tricky.

          1. First of all, instead of using "leaderIsrReadLock synchronized", we should do "leaderIsrReadLock.lock()".

          2. Second, we should use a fair readWriteLock. Otherwise, some threads may be indefinitely postponed.

          3. Third, from java doc, ReentrantReadWriteLock doesn't support upgrading from read lock to write loc.
          " Lock downgrading
          Reentrancy also allows downgrading from the write lock to a read lock, by acquiring the write lock, then the read lock and then releasing the write lock. However, upgrading from a read lock to the write lock is not possible. "

          This means that if we need to call updateIsr(), we have to first release the read lock and require the read lock again when done. See the following example. However, this means that we are still vulnerable to the issue in maybeIncrementLeaderHW() (kafka-862). We probably can change the logic in maybeIncrementLeaderHW() so that it can handle empty set. We will need to think a bit more how to write the logic in a clean way.

          http://codereview.stackexchange.com/questions/12939/reentrantreadwritelock-lock-upgrade-method

          Another possibility is to just take v1 patch. All producers to the same log will sync on the leaderIsrUpdateLock. In log.append(), the only code outside the log lock are analyzeAndValidateMessageSet() and maybeFlush(). The former is cheap since it does shallow iteration. The latter re-requires the log lock if flush if needed.

          Show
          Jun Rao added a comment - Thanks for patch v2. This turns out to be a bit more tricky. 1. First of all, instead of using "leaderIsrReadLock synchronized", we should do "leaderIsrReadLock.lock()". 2. Second, we should use a fair readWriteLock. Otherwise, some threads may be indefinitely postponed. 3. Third, from java doc, ReentrantReadWriteLock doesn't support upgrading from read lock to write loc. " Lock downgrading Reentrancy also allows downgrading from the write lock to a read lock, by acquiring the write lock, then the read lock and then releasing the write lock. However, upgrading from a read lock to the write lock is not possible. " This means that if we need to call updateIsr(), we have to first release the read lock and require the read lock again when done. See the following example. However, this means that we are still vulnerable to the issue in maybeIncrementLeaderHW() (kafka-862). We probably can change the logic in maybeIncrementLeaderHW() so that it can handle empty set. We will need to think a bit more how to write the logic in a clean way. http://codereview.stackexchange.com/questions/12939/reentrantreadwritelock-lock-upgrade-method Another possibility is to just take v1 patch. All producers to the same log will sync on the leaderIsrUpdateLock. In log.append(), the only code outside the log lock are analyzeAndValidateMessageSet() and maybeFlush(). The former is cheap since it does shallow iteration. The latter re-requires the log lock if flush if needed.
          Hide
          Neha Narkhede added a comment -

          I like the option of picking a simpler solution for now and filing a performance improvement bug to come back and do it properly.

          Show
          Neha Narkhede added a comment - I like the option of picking a simpler solution for now and filing a performance improvement bug to come back and do it properly.
          Hide
          Jay Kreps added a comment -

          The reason for the more granular locking in Log was to avoid locking around the flush. However we since learned that the flush effectively locks the log no matter what, so it doesn't make any difference. So I am not sure that V1 will be a performance hit. What would be really nice would be automated perf tests that checked this kind of thing so we could spot regressions.

          Show
          Jay Kreps added a comment - The reason for the more granular locking in Log was to avoid locking around the flush. However we since learned that the flush effectively locks the log no matter what, so it doesn't make any difference. So I am not sure that V1 will be a performance hit. What would be really nice would be automated perf tests that checked this kind of thing so we could spot regressions.
          Hide
          Neha Narkhede added a comment -

          I checked in patch v1 and will see how that goes.

          Show
          Neha Narkhede added a comment - I checked in patch v1 and will see how that goes.
          Hide
          Neha Narkhede added a comment -

          v1 fixes the issue

          Show
          Neha Narkhede added a comment - v1 fixes the issue

            People

            • Assignee:
              Neha Narkhede
              Reporter:
              Neha Narkhede
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development