Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-5610

KafkaApis.handleWriteTxnMarkerRequest can return UNSUPPORTED_FOR_MESSAGE_FORMAT error on partition emigration

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.11.0.0
    • Fix Version/s: 0.11.0.1, 1.0.0
    • Component/s: None
    • Labels:

      Description

      This bug was revealed by the following system test failure http://confluent-systest.s3-website-us-west-2.amazonaws.com/confluent-kafka-system-test-results/?prefix=2017-07-18--001.1500383975--apache--trunk--28c83d9/

      What happened was that a commit marker to the offsets topic (sent as part of the producer.sendOffsetsToTransaction method) was lost, causing data to be reprocessed, and hence causing the test to fail.

      The bug is that the wrong error code is returned from the handleWriteTxnMarker request when there is partition emigration. In particular, we have:

      for (marker <- markers.asScala) {
            val producerId = marker.producerId
            val (goodPartitions, partitionsWithIncorrectMessageFormat) = marker.partitions.asScala.partition { partition =>
              replicaManager.getMagic(partition) match {
                case Some(magic) if magic >= RecordBatch.MAGIC_VALUE_V2 => true
                case _ => false
              }
            }
      
            if (partitionsWithIncorrectMessageFormat.nonEmpty) {
              val currentErrors = new ConcurrentHashMap[TopicPartition, Errors]()
              partitionsWithIncorrectMessageFormat.foreach { partition => currentErrors.put(partition, Errors.UNSUPPORTED_FOR_MESSAGE_FORMAT) }
              updateErrors(producerId, currentErrors)
            }
      

      But the replicaManager.getMagic() call will return None when the partition emigrates, causing the handleWriteTxnMarkersRequest to return an UNSUPPORTED_FOR_MESSAGE_FORMAT error.

      From the log, we see that the partition did emigrate a few milliseconds before the WriteTxnMarkerRequest was sent.

      On the old leader, worker10:

      ./worker10/debug/server.log:32245:[2017-07-18 05:43:20,950] INFO [GroupCoordinator 2]: Unloading group metadata for transactions-test-consumer-group with generation 0 (kafka.coordinator.group.GroupCoordinator)
      

      On the client:

      [2017-07-18 05:43:20,959] INFO [Transaction Marker Request Completion Handler 1]: Sending my-first-transactional-id's transaction marker from partition __consumer_offsets-47 has failed with  UNSUPPORTED_FOR_MESSAGE_FORMAT. This partition will be removed from the set of partitions waiting for completion (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler)
      

      As you can see, the client received the response 9 ms after the emigration was initiated on the server.

      Since it is perfectly acceptable for the LeaderISR metadata to be propagated asynchronously, we should have more robust handling of emgiration in KafkaApis so that it returns the right error code when handling a request for a partition for which the broker is no longer the leader.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                apurva Apurva Mehta
                Reporter:
                apurva Apurva Mehta
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: