Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4384

ReplicaFetcherThread stopped after ReplicaFetcherThread received a corrupted message

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0.1, 0.10.0.0, 0.10.0.1, 0.10.1.0
    • Fix Version/s: 0.10.1.1, 0.10.2.0
    • Component/s: core
    • Labels:
      None
    • Environment:
      Ubuntu 12.04, AWS D2 instance

      Description

      We recently discovered an issue in Kafka 0.9.0.1 (), where ReplicaFetcherThread stopped after ReplicaFetcherThread received a corrupted message. As the same logic exists also in Kafka 0.10.0.0 and 0.10.0.1, they may have the similar issue.

      Here are system logs related to this issue.
      2016-10-30 - 02:33:05,606 ERROR ReplicaFetcherThread-5-1590174474 ReplicaFetcherThread.apply - Found invalid messages during fetch for partition [logs,41] offset 39021512238 error Message is corrupt (stored crc = 2028421553, computed crc = 3577227678)
      2016-10-30 - 02:33:06,582 ERROR ReplicaFetcherThread-5-1590174474 ReplicaFetcherThread.error - [ReplicaFetcherThread-5-1590174474], Error due to kafka.common.KafkaException: - error processing data for partition [logs,41] offset 39021512301 Caused - by: java.lang.RuntimeException: Offset mismatch: fetched offset = 39021512301, log end offset = 39021512238.

      First, ReplicaFetcherThread got a corrupted message (offset 39021512238) due to some blip.

      Line https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L138 threw exception

      Then, Line https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L145 caught it and logged this error.

      Because https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L134 updated the topic partition offset to the fetched latest one in partitionMap. So ReplicaFetcherThread skipped the batch with corrupted messages.

      Based on https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L84, the ReplicaFetcherThread then directly fetched the next batch of messages (with offset 39021512301)

      Next, ReplicaFetcherThread stopped because the log end offset (still 39021512238) didn't match the fetched message (offset 39021512301).

      A quick fix is to move line 134 to be after line 138.

      Would be great to have your comments and please let me know if a Jira issue is needed. Thanks.

        Attachments

        1. KAFKA-4384.patch
          6 kB
          Jun He

          Issue Links

            Activity

              People

              • Assignee:
                junhe Jun He
                Reporter:
                junhe Jun He
                Reviewer:
                Jiangjie Qin
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: