Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-2891 Gaps in messages delivered by new consumer after Kafka restart
  3. KAFKA-2908

Consumer Sporadically Stops Consuming From Partition After Server Restart

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • None
    • None
    • consumer
    • None

    Description

      Context:
      Instance of the rolling upgrade test. 10s sleeps have been put around node restarts to ensure stability when subsequent nodes go down. Consumer timeout has been set to 60s. Proudcer has been throttled to 100 messages per second.

      Failure is rare: It occurred once in 60 executions (6 executions per run #276 -> #285 of the system_test_branch_builder)

      Reported Failure:

      At least one acked message did not appear in the consumed messages. acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 16442, ...plus 1669 more

      Immediate Observations:

      • The list of messages not consumed are all in partition 1.
      • Production and Consumption continues throughout the test (there is no complete write or read failure as we have seen elsewhere)
      • The messages ARE present in the data files:
        e.g.

        Examining missing value 16385. This was written to p1,offset:5453
        => there is an entry in all data files for partition 1 for this (presumably meaning it was replicated correctly)

        kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files worker10/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: 5453'

        worker10,9,8:
        offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385

        => ID 16385 is definitely present in the Kafka logs suggesting a problem consumer-side

      These entries definitely do not appear in the consumer stdout file either.

      Timeline:

      15:27:02,232 - Producer sends first message
      ..servers 1, 2 are restarted (clean shutdown)
      15:29:42,718 - Server 3 shutdown complete
      15:29:42,712 - (Controller fails over): Broker 2 starting become controller state transition (kafka.controller.KafkaController)
      15:29:42,743 - New leasder is 2 (LeaderChangeListner)
      15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 with correlation id 0 epoch 7 for partition (test_topic,1) since its associated leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
      15:29:45,642 - Producer starts writing messages that are never consumed
      15:30:10,804 - Last message sent by producer
      15:31:10,983 - Consumer times out after 60s wait without messages

      Logs for this run are attached

      Attachments

        1. 2015-11-28--001.tar.gz
          1.67 MB
          Ben Stopford

        Activity

          People

            hachikuji Jason Gustafson
            benstopford Ben Stopford
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: