[CASSANDRA-16677] Test Failure: Fix flaky ConnectionTest - ASF JIRA

Berenguer Blasi added a comment - 24/May/21 07:42

https://ci-cassandra.apache.org/job/Cassandra-4.0/53/testReport/junit/org.apache.cassandra.net/ConnectionTest/testMessageDeliveryOnReconnect_cdc/ #collaborating

Berenguer Blasi added a comment - 24/May/21 07:42 https://ci-cassandra.apache.org/job/Cassandra-4.0/53/testReport/junit/org.apache.cassandra.net/ConnectionTest/testMessageDeliveryOnReconnect_cdc/ #collaborating

Brandon Williams added a comment - 24/May/21 17:06

Thanks for that. Now I'm pretty sure there are two to three bugs going on here.

Brandon Williams added a comment - 24/May/21 17:06 Thanks for that. Now I'm pretty sure there are two to three bugs going on here.

Brandon Williams added a comment - 26/May/21 19:22 - edited

I'm giving up on this for now, as it can only damage my calm further to continue.

What I've found is that sometimes you will see a stack that seems to indicate the channel was already closed:

[junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-05-26 16:38:31,164 OutboundConnection.java:492 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error
[junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:209)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:166)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:238)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:259)
[junit-timeout]         at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:117)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:1002)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:691)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:960)
[junit-timeout]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[junit-timeout]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[junit-timeout]         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout]         at java.lang.Thread.run(Thread.java:748)

if you add a test for the channel being open before this, it passes. Other times the test will fail in what appears to be the same manner, but there will be no stacktrace at all. Most of the failures occur with LARGE_MESSAGES, but given enough attempts there is occasionally one in SMALL_MESSAGES.

Then there is the unique stack that Berenguer linked, which I have yet to encounter myself. It is odd too, in that according to it both small and large messages are expecting 30 bytes and only getting 29, but all my debugging shows 29 bytes working just fine for large messages, and 11 for small (where the stack wants 14, receiving 13.)

Brandon Williams added a comment - 26/May/21 19:22 - edited I'm giving up on this for now, as it can only damage my calm further to continue. What I've found is that sometimes you will see a stack that seems to indicate the channel was already closed: [junit-timeout] WARN [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-05-26 16:38:31,164 OutboundConnection.java:492 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:209) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:166) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:238) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:259) [junit-timeout] at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:117) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:1002) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:691) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:960) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748) if you add a test for the channel being open before this, it passes. Other times the test will fail in what appears to be the same manner, but there will be no stacktrace at all. Most of the failures occur with LARGE_MESSAGES, but given enough attempts there is occasionally one in SMALL_MESSAGES. Then there is the unique stack that Berenguer linked, which I have yet to encounter myself. It is odd too, in that according to it both small and large messages are expecting 30 bytes and only getting 29, but all my debugging shows 29 bytes working just fine for large messages, and 11 for small (where the stack wants 14, receiving 13.)

Berenguer Blasi added a comment - 04/Jun/21 09:14 - edited

I am really confused. After the recent sudden protocol/connection flakies I can't run Connection test locally anymore at all without failing with a million such messages. I have gone back as far as 4.0-beta4 and it still fails:

ant test -Dtest.name=ConnectionTest

[junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-06-04 10:42:00,776 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-8fa2dfba dropping message of type _TEST_1 due to error
[junit-timeout] org.apache.cassandra.net.InvalidSerializedSizeException: Invalid serialized size; expected 30, actual size at least 29, for verb _TEST_1
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955)
[junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[junit-timeout] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout] 	at java.lang.Thread.run(Thread.java:748)

Berenguer Blasi added a comment - 04/Jun/21 09:14 - edited I am really confused. After the recent sudden protocol/connection flakies I can't run Connection test locally anymore at all without failing with a million such messages. I have gone back as far as 4.0-beta4 and it still fails: ant test -Dtest.name=ConnectionTest [junit-timeout] WARN [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-06-04 10:42:00,776 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-8fa2dfba dropping message of type _TEST_1 due to error [junit-timeout] org.apache.cassandra.net.InvalidSerializedSizeException: Invalid serialized size; expected 30, actual size at least 29, for verb _TEST_1 [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748)

Berenguer Blasi added a comment - 07/Jun/21 05:07 - edited

Well, repeating 1000 times the test on CI gives a completely different error and not the above one ???

junit.framework.AssertionFailedError: expected:<REMOVED_NODE /127.0.0.1:9042> but was:<null>
	at org.apache.cassandra.transport.ClientNotificiationsTest$EventHandler.assertNextEvent(ClientNotificiationsTest.java:109)
	at org.apache.cassandra.transport.ClientNotificiationsTest.testNotifications(ClientNotificiationsTest.java:95)

Berenguer Blasi added a comment - 07/Jun/21 05:07 - edited Well, repeating 1000 times the test on CI gives a completely different error and not the above one ??? junit.framework.AssertionFailedError: expected:<REMOVED_NODE /127.0.0.1:9042> but was:<null> at org.apache.cassandra.transport.ClientNotificiationsTest$EventHandler.assertNextEvent(ClientNotificiationsTest.java:109) at org.apache.cassandra.transport.ClientNotificiationsTest.testNotifications(ClientNotificiationsTest.java:95)

Ekaterina Dimitrova added a comment - 07/Jun/21 23:39

I missed this one and opened a new one ~~CASSANDRA-16723~~ which I am closing as duplicate to this one.

I didn't dig into it much, it doesn't fail for me locally.

My first suspicion(and probably hope) was that the latch2.await(); method needs more time. Increasing it didn't make the test failing less. Quick validation whether it is really just timing for me was to use simply await() and the test started timing out. Thanks brandon.williams for pointing to the original ticket.

Ekaterina Dimitrova added a comment - 07/Jun/21 23:39 I missed this one and opened a new one CASSANDRA-16723 which I am closing as duplicate to this one. I didn't dig into it much, it doesn't fail for me locally. My first suspicion(and probably hope) was that the latch2.await(); method needs more time. Increasing it didn't make the test failing less. Quick validation whether it is really just timing for me was to use simply await() and the test started timing out. Thanks brandon.williams for pointing to the original ticket.

Berenguer Blasi added a comment - 22/Jun/21 05:02

e.dimitrova I was looking into this one in passive mode. I have a couple improvements you can pull here. My 2cts

Berenguer Blasi added a comment - 22/Jun/21 05:02 e.dimitrova I was looking into this one in passive mode. I have a couple improvements you can pull here . My 2cts

Ekaterina Dimitrova added a comment - 22/Jun/21 14:00

Thanks, I already have them as indeed the error message was looking weird. I will try to play a bit more with it today if nothing urgent pops up.

Ekaterina Dimitrova added a comment - 22/Jun/21 14:00 Thanks, I already have them as indeed the error message was looking weird. I will try to play a bit more with it today if nothing urgent pops up.

Ekaterina Dimitrova added a comment - 24/Jun/21 19:59

I didn't manage to find anything more interesting and I am not able even to reproduce all errors you mentioned.

I will be off big part of the next 10 days so I unassigned it in case someone wants to give it a try in the meanwhile.

Ekaterina Dimitrova added a comment - 24/Jun/21 19:59 I didn't manage to find anything more interesting and I am not able even to reproduce all errors you mentioned. I will be off big part of the next 10 days so I unassigned it in case someone wants to give it a try in the meanwhile.

Berenguer Blasi added a comment - 07/Jul/21 06:09

The problem comes from trying to flush against an already closed connecttion. Adding some stacks for future reference:

[junit-timeout] INFO  [Messaging-EventLoop-3-7] 2021-07-06 12:17:23,858 InboundConnectionInitiator.java:464 - /127.0.0.1:7012(/127.0.0.1:49576)->/127.0.0.1:7012-LARGE_MESSAGES-6cbecdc7 messaging connection established, version = 12, framing = LZ4, encryption = encrypted(factory=openssl;protocol=TLSv1.3;cipher=TLS_AES_128_GCM_SHA256)
[junit-timeout] INFO  [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:17012), nic: lo, encryption: encrypted(openssl)
[junit-timeout] INFO  [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:7012), nic: lo, encryption: optionally encrypted(openssl)
[junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-07-06 12:17:23,868 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error
[junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:201)
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230)
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248)
[junit-timeout] 	at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955)
[junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[junit-timeout] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout] 	at java.lang.Thread.run(Thread.java:748)
[junit-timeout] Caused by: io.netty.handler.ssl.SslClosedEngineException: SSLEngine closed already
[junit-timeout] 	at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:854)
[junit-timeout] 	at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:811)
[junit-timeout] 	at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:792)
[junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
[junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
[junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
[junit-timeout] 	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125)
[junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
[junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765)
[junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
[junit-timeout] 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
[junit-timeout] 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
[junit-timeout] 	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
[junit-timeout] 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
[junit-timeout] 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[junit-timeout] 	... 2 common frames omitted

[junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel this output stream was writing to has been closed
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200)
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230)
[junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248)
[junit-timeout] 	at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
[junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955)
[junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[junit-timeout] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout] 	at java.lang.Thread.run(Thread.java:748)
[junit-timeout] Caused by: io.netty.channel.StacklessClosedChannelException: null
[junit-timeout] 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

Seems like waiting on AsyncPromise.isDone() on outbound.close fixes it but idk why yet.

Berenguer Blasi added a comment - 07/Jul/21 06:09 The problem comes from trying to flush against an already closed connecttion. Adding some stacks for future reference: [junit-timeout] INFO [Messaging-EventLoop-3-7] 2021-07-06 12:17:23,858 InboundConnectionInitiator.java:464 - /127.0.0.1:7012(/127.0.0.1:49576)->/127.0.0.1:7012-LARGE_MESSAGES-6cbecdc7 messaging connection established, version = 12, framing = LZ4, encryption = encrypted(factory=openssl;protocol=TLSv1.3;cipher=TLS_AES_128_GCM_SHA256) [junit-timeout] INFO [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:17012), nic: lo, encryption: encrypted(openssl) [junit-timeout] INFO [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:7012), nic: lo, encryption: optionally encrypted(openssl) [junit-timeout] WARN [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-07-06 12:17:23,868 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:201) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) [junit-timeout] at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748) [junit-timeout] Caused by: io.netty.handler.ssl.SslClosedEngineException: SSLEngine closed already [junit-timeout] at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:854) [junit-timeout] at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:811) [junit-timeout] at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:792) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728) [junit-timeout] at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) [junit-timeout] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [junit-timeout] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [junit-timeout] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [junit-timeout] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [junit-timeout] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [junit-timeout] ... 2 common frames omitted [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel this output stream was writing to has been closed [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) [junit-timeout] at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748) [junit-timeout] Caused by: io.netty.channel.StacklessClosedChannelException: null [junit-timeout] at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source) Seems like waiting on AsyncPromise.isDone() on outbound.close fixes it but idk why yet.

Ekaterina Dimitrova added a comment - 07/Jul/21 15:34

Yes, I think this is one of the things that also Brandon found. But in my logs when the test fails this warning doesn't appear

According to outbound.close, I saw it is invoked with flushQueue=false

Did you mean the test is not failing anymore at all if you use AsyncPromise.isDone()?

Also, I wanted to confirm, did you trigger the warning in a specific way or the test just fails like that for you?

Ekaterina Dimitrova added a comment - 07/Jul/21 15:34 Yes, I think this is one of the things that also Brandon found. But in my logs when the test fails this warning doesn't appear According to outbound.close, I saw it is invoked with flushQueue=false Did you mean the test is not failing anymore at all if you use AsyncPromise.isDone() ? Also, I wanted to confirm, did you trigger the warning in a specific way or the test just fails like that for you?

Berenguer Blasi added a comment - 08/Jul/21 10:24

There are a few test cases that can fail in this test. Idk if the all have a common root cause. But so far after diving among all the async, SSL and some netty code I can't pin down yet who is closing that connection. There are so many asyncs, threads, loose optimistic concurrent assumptions, etc... I don't see anything obvious in the netty code either but I don't know it thoroughly.

I have been focusing on the outbound closing logic. But now I am focusing on the inbound side of things where a well placed Thread.sleep() seems to fix it (still to be confirmed). I am afraid I have only 3 options:

Download netty sources, add debug and continue the hunt. Worth it?
Add a Thread.sleep()
Mark as flaky

I need to think a bit more about this...

Berenguer Blasi added a comment - 08/Jul/21 10:24 There are a few test cases that can fail in this test. Idk if the all have a common root cause. But so far after diving among all the async, SSL and some netty code I can't pin down yet who is closing that connection. There are so many asyncs, threads, loose optimistic concurrent assumptions, etc... I don't see anything obvious in the netty code either but I don't know it thoroughly. I have been focusing on the outbound closing logic. But now I am focusing on the inbound side of things where a well placed Thread.sleep() seems to fix it (still to be confirmed). I am afraid I have only 3 options: Download netty sources, add debug and continue the hunt. Worth it? Add a Thread.sleep() Mark as flaky I need to think a bit more about this...

Berenguer Blasi added a comment - 13/Jul/21 08:45

This won't be ready for 4.0.0 GA cut and we're hitting a point of diminishing return. Parking it until there's gap where dealing with netty code makes sense again or I regain the appetite.

Berenguer Blasi added a comment - 13/Jul/21 08:45 This won't be ready for 4.0.0 GA cut and we're hitting a point of diminishing return. Parking it until there's gap where dealing with netty code makes sense again or I regain the appetite.

Josh McKenzie added a comment - 13/Jul/21 16:00

Rather than dumping stack, might be worth it to catch the exceptions we know are benign and "expected" (i.e. access Channel after close) and printing out a slightly more useful error message with it, perhaps even pointing to this ticket.

Not much precedent for that in the codebase as we obviously prefer to handle and fix things properly, but leading up to a .0 GA the extra hygiene might be worth it. Just a passing thought.

Josh McKenzie added a comment - 13/Jul/21 16:00 Rather than dumping stack, might be worth it to catch the exceptions we know are benign and "expected" (i.e. access Channel after close) and printing out a slightly more useful error message with it, perhaps even pointing to this ticket. Not much precedent for that in the codebase as we obviously prefer to handle and fix things properly, but leading up to a .0 GA the extra hygiene might be worth it. Just a passing thought.

Berenguer Blasi added a comment - 14/Jul/21 05:10 - edited

A similar alternative would be to mark it flaky which I am very tempted of. The problem I have with these solutions is that I can't assure those exceptions being benign. So painting a GA green by cosmetics rather than addressing the problem is sthg I hard -1 to myself lol... Why do we access a channel after a close? Who closes it? etc

I just need to take a break from this and when the next gap comes along get netty's code, put extra logging all around and see who is closing that connection. My 2tcs.

But thx for the suggestion. All help welcomed.

Berenguer Blasi added a comment - 14/Jul/21 05:10 - edited A similar alternative would be to mark it flaky which I am very tempted of. The problem I have with these solutions is that I can't assure those exceptions being benign. So painting a GA green by cosmetics rather than addressing the problem is sthg I hard -1 to myself lol... Why do we access a channel after a close? Who closes it? etc I just need to take a break from this and when the next gap comes along get netty's code, put extra logging all around and see who is closing that connection. My 2tcs. But thx for the suggestion. All help welcomed.

Josh McKenzie added a comment - 14/Jul/21 14:07

I believe the last run of the numbers showed something like 13% of flaky tests being actual product defects. I'm personally not very comfortable marking things as flaky unless we have a strong understanding of why the tests are flaking out, and at that point we probably know more than enough to fix them anyway.

Josh McKenzie added a comment - 14/Jul/21 14:07 I believe the last run of the numbers showed something like 13% of flaky tests being actual product defects. I'm personally not very comfortable marking things as flaky unless we have a strong understanding of why the tests are flaking out, and at that point we probably know more than enough to fix them anyway.

Aleksei Zotov added a comment - 09/Feb/22 09:09

I can see it failed again for 935 and 937 builds. I linked this ticket to the failure Butler. Attaching the full log for 937 build (in case someone may want to check that in the future):

TEST-org.apache.cassandra.net.ConnectionTest.log.xz

Aleksei Zotov added a comment - 09/Feb/22 09:09 I can see it failed again for 935 and 937 builds. I linked this ticket to the failure Butler. Attaching the full log for 937 build (in case someone may want to check that in the future): TEST-org.apache.cassandra.net.ConnectionTest.log.xz

Aleksei Zotov added a comment - 09/Feb/22 09:12

Also I can see that https://ci-cassandra.apache.org/job/Cassandra-trunk/936/testReport/org.apache.cassandra.net/ConnectionTest/testTimeout/ failed once for trunk. However, the error message is pretty much the same, so linking to this ticket as well.

Aleksei Zotov added a comment - 09/Feb/22 09:12 Also I can see that https://ci-cassandra.apache.org/job/Cassandra-trunk/936/testReport/org.apache.cassandra.net/ConnectionTest/testTimeout/ failed once for trunk. However, the error message is pretty much the same, so linking to this ticket as well.

Andres de la Peña added a comment - 22/Mar/22 12:13

More failures on https://ci-cassandra.apache.org/job/Cassandra-trunk/1032/testReport/org.apache.cassandra.transport/CQLConnectionTest

They can be reproduced with the multiplexer: https://app.circleci.com/pipelines/github/adelapena/cassandra/1405/workflows/88fb8e4f-7a3f-42f9-a6db-5d14a652b54e

Andres de la Peña added a comment - 22/Mar/22 12:13 More failures on https://ci-cassandra.apache.org/job/Cassandra-trunk/1032/testReport/org.apache.cassandra.transport/CQLConnectionTest They can be reproduced with the multiplexer: https://app.circleci.com/pipelines/github/adelapena/cassandra/1405/workflows/88fb8e4f-7a3f-42f9-a6db-5d14a652b54e

Jacek Lewandowski added a comment - 27/Sep/23 10:52

https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/909/workflows/8c169ac2-a284-426b-9a64-510e966e63d5/jobs/32408

Jacek Lewandowski added a comment - 27/Sep/23 10:52 https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/909/workflows/8c169ac2-a284-426b-9a64-510e966e63d5/jobs/32408

Apache Cassandra

Test Failure: Fix flaky ConnectionTest

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates