Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried.

What changed

The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!).
With regression, following metrics changed on the produce side.

record-queue-time-avg: increased from 20ms to 30ms.
request-latency-avg: increased from 50ms to 100ms.

Why it happened

As can be seen from the original PR RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders.

Lock profiles clearly show increased synchronisation in ~~KAFKA-15415~~ PR(highlighted in Red) Vs baseline ( see below ). Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions!

Lock Profile: Kafka-15415

Lock Profile: Baseline

Fix

Synchronization has to be reduced between 2 threads in order to address this. https://github.com/apache/kafka/pull/15323 is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor().

With the fix, lock-profile & metrics are similar to baseline.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

kafka_15415_lock_profile.png
06/Feb/24 15:48
343 kB
Mayank Shekhar Narula
baseline_lock_profile.png
06/Feb/24 15:48
334 kB
Mayank Shekhar Narula

Issue Links

is a child of

KAFKA-15868 KIP-951 - Leader discovery optimisations for the client

Closed

is a clone of

KAFKA-15415 In Java-client, backoff should be skipped for retried producer-batch to a new leader

Resolved

is duplicated by

KAFKA-16259 Immutable MetadataCache to improve client performance

Resolved

links to

GitHub Pull Request #15323

GitHub Pull Request #15385

GitHub Pull Request #15493

GitHub Pull Request #15498

(2 links to)

Activity

People

Assignee:: Mayank Shekhar Narula

Reporter:: Mayank Shekhar Narula

Reviewer:: Walker Carlson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Feb/24 15:07

Updated:: 14/Mar/24 15:12

Resolved:: 14/Mar/24 14:09