Description
Sasl/SslConsumerTest.testSimpleConsumption() fails intermittently with a failure in commitAsync(). The exception stack trace shows:
kafka.api.SaslPlaintextConsumerTest.testSimpleConsumption FAILED
java.lang.AssertionError: expected:<1> but was:<0>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at kafka.api.BaseConsumerTest.awaitCommitCallback(BaseConsumerTest.scala:340)
at kafka.api.BaseConsumerTest.testSimpleConsumption(BaseConsumerTest.scala:85)
I have recreated this with some additional trace. The tests run with a very small metadata expiry interval, triggering metadata updates quite often. If a metadata request immediately following a commitAsync() call creates a new SSL/SASL connection, ConsumerNetworkClient.poll returns to process the connection handshake packets. Since ConsumerNetworkClient.poll discards all unsent packets before returning from poll, this can result in the failure of the commit - the callback is invoked with SendFailedException.
I understand that ConsumerNetworkClient.poll() discards unsent packets rather than buffer them to keep the code simple. And perhaps it is ok to fail commitAsync occasionally since the callback does indicate that the caller should retry. But it feels like an unnecessary limitation that requires error handling in client applications when there are no real failures and makes it much harder to test reliably. As special handling to fix issues like KAFKA-3412, KAFKA-2672 adds more complexity to the code anyway, and because it is much harder to debug failures that affect only SSL/SASL, it may be worth considering improving this behaviour.
I will see if I can submit a PR for the specific issue I was seeing with the impact of handshakes on commitAsync(), but I will be interested in views on improving the logic in ConsumerNetworkClient.