Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-14312

Kraft + ProducerStateManager: produce requests to new partitions with a non-zero sequence number should be rejected

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • kraft, producer
    • None

    Description

      Background

      In Kraft mode, if I create a topic, I am occasionally seeing MetadataResponse with a valid leader, and if I immediately produce to that topic, I am seeing NOT_LEADER_FOR_PARTITION. There may be another bug causing Kraft to return a leader in metadata but reject requests to that leader, but this is showing a bigger problem.

      Kafka currently accepts produce requests to new partitions with a non-zero sequence number. I have confirmed this locally by modifying my client to start producing with a sequence number of 10. Producing three records sequentially back to back (seq 10, 11, 12) are all successful. I think this comment in the Kafka source also indicates roughly the same thing.

      Problem

      • Client initializes producer ID
      • Client creates topic "foo" (for the problem, we will ignore partitions – there is just one partition)
      • Client sends produce request A with 5 records
      • Client sends produce request B with 5 records before receiving a response for A
      • Broker returns NOT_LEADER_FOR_PARTITION to produce request A
      • Broker finally initializes, becomes leader before seeing request B
      • Broker accepts request B as the first request
      • Broker believes sequence number 5 is ok, and is expecting the next sequence to be 10
      • Client retries requests A and B, because A failed
      • Broker sees request A with sequence 0, returns OutOfOrderSequenceException
      • Client enters a fatal state, because OOOSN is not retryable

      Reproducing

      I can reliably reproduce this error using Kraft mode with 1 broker. I am using the following docker compose:

      version: "3.7"
      services:
        kafka:
          image: bitnami/kafka:latest
          network_mode: host
          environment:
            KAFKA_ENABLE_KRAFT: yes
            KAFKA_CFG_PROCESS_ROLES: controller,broker
            KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
            KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
            KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
            KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 1@127.0.0.1:9093
            # Set this to "PLAINTEXT://127.0.0.1:9092" if you want to run this container on localhost via Docker
            KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:9092
            KAFKA_CFG_BROKER_ID: 1
            ALLOW_PLAINTEXT_LISTENER: yes
            KAFKA_KRAFT_CLUSTER_ID: XkpGZQ27R3eTl3OdTm2LYA # 16 byte base64-encoded UUID
            BITNAMI_DEBUG: true # Enable this to get more info on startup failures

       

      I am running the franz-go integration tests to trigger this (frequently, but not all of the time). However, these tests are not required. The behavior described above can occasionally reproduce this.

      I have never experienced this against the zookeeper version. It seems that the zk version always fully initializes a topic immediately and does not return NOT_LEADER_FOR_PARTITION on the first produce request. This is a separate problem – but the main problem described above exists in all versions, and can be experienced in zk in very strange circumstances.

      Attachments

        Activity

          People

            Unassigned Unassigned
            twmb Travis Bischel
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: