Details

    • Correctness - Test Failure
    • Normal
    • Normal
    • Unit Test
    • All
    • None

    Attachments

      Issue Links

        Activity

          bereng Berenguer Blasi added a comment - https://ci-cassandra.apache.org/job/Cassandra-4.0/53/testReport/junit/org.apache.cassandra.net/ConnectionTest/testMessageDeliveryOnReconnect_cdc/ #collaborating

          Thanks for that. Now I'm pretty sure there are two to three bugs going on here.

          brandon.williams Brandon Williams added a comment - Thanks for that. Now I'm pretty sure there are two to three bugs going on here.
          brandon.williams Brandon Williams added a comment - - edited

          I'm giving up on this for now, as it can only damage my calm further to continue.

          What I've found is that sometimes you will see a stack that seems to indicate the channel was already closed:

          [junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-05-26 16:38:31,164 OutboundConnection.java:492 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error
          [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed
          [junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:209)
          [junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:166)
          [junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:238)
          [junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:259)
          [junit-timeout]         at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:117)
          [junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:1002)
          [junit-timeout]         at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:691)
          [junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:960)
          [junit-timeout]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          [junit-timeout]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          [junit-timeout]         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          [junit-timeout]         at java.lang.Thread.run(Thread.java:748)
          

          if you add a test for the channel being open before this, it passes. Other times the test will fail in what appears to be the same manner, but there will be no stacktrace at all. Most of the failures occur with LARGE_MESSAGES, but given enough attempts there is occasionally one in SMALL_MESSAGES.

          Then there is the unique stack that Berenguer linked, which I have yet to encounter myself. It is odd too, in that according to it both small and large messages are expecting 30 bytes and only getting 29, but all my debugging shows 29 bytes working just fine for large messages, and 11 for small (where the stack wants 14, receiving 13.)

          brandon.williams Brandon Williams added a comment - - edited I'm giving up on this for now, as it can only damage my calm further to continue. What I've found is that sometimes you will see a stack that seems to indicate the channel was already closed: [junit-timeout] WARN [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-05-26 16:38:31,164 OutboundConnection.java:492 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:209) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:166) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:238) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:259) [junit-timeout] at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:117) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:1002) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:691) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:960) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748) if you add a test for the channel being open before this, it passes. Other times the test will fail in what appears to be the same manner, but there will be no stacktrace at all. Most of the failures occur with LARGE_MESSAGES, but given enough attempts there is occasionally one in SMALL_MESSAGES. Then there is the unique stack that Berenguer linked, which I have yet to encounter myself. It is odd too, in that according to it both small and large messages are expecting 30 bytes and only getting 29, but all my debugging shows 29 bytes working just fine for large messages, and 11 for small (where the stack wants 14, receiving 13.)
          bereng Berenguer Blasi added a comment - - edited

          I am really confused. After the recent sudden protocol/connection flakies I can't run Connection test locally anymore at all without failing with a million such messages. I have gone back as far as 4.0-beta4 and it still fails:

          ant test -Dtest.name=ConnectionTest
          
          [junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-06-04 10:42:00,776 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-8fa2dfba dropping message of type _TEST_1 due to error
          [junit-timeout] org.apache.cassandra.net.InvalidSerializedSizeException: Invalid serialized size; expected 30, actual size at least 29, for verb _TEST_1
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955)
          [junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          [junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          [junit-timeout] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          [junit-timeout] 	at java.lang.Thread.run(Thread.java:748)
          
          bereng Berenguer Blasi added a comment - - edited I am really confused. After the recent sudden protocol/connection flakies I can't run Connection test locally anymore at all without failing with a million such messages. I have gone back as far as 4.0-beta4 and it still fails: ant test -Dtest.name=ConnectionTest [junit-timeout] WARN [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-06-04 10:42:00,776 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-8fa2dfba dropping message of type _TEST_1 due to error [junit-timeout] org.apache.cassandra.net.InvalidSerializedSizeException: Invalid serialized size; expected 30, actual size at least 29, for verb _TEST_1 [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748)
          bereng Berenguer Blasi added a comment - - edited

          Well, repeating 1000 times the test on CI gives a completely different error and not the above one ???

          junit.framework.AssertionFailedError: expected:<REMOVED_NODE /127.0.0.1:9042> but was:<null>
          	at org.apache.cassandra.transport.ClientNotificiationsTest$EventHandler.assertNextEvent(ClientNotificiationsTest.java:109)
          	at org.apache.cassandra.transport.ClientNotificiationsTest.testNotifications(ClientNotificiationsTest.java:95)
          
          bereng Berenguer Blasi added a comment - - edited Well, repeating 1000 times the test on CI gives a completely different error and not the above one ??? junit.framework.AssertionFailedError: expected:<REMOVED_NODE /127.0.0.1:9042> but was:<null> at org.apache.cassandra.transport.ClientNotificiationsTest$EventHandler.assertNextEvent(ClientNotificiationsTest.java:109) at org.apache.cassandra.transport.ClientNotificiationsTest.testNotifications(ClientNotificiationsTest.java:95)

          I missed this one and opened a new one CASSANDRA-16723 which I am closing as duplicate to this one.

          I didn't dig into it much, it doesn't fail for me locally.

          My first suspicion(and probably hope) was that the latch2.await(); method needs more time. Increasing it didn't make the test failing less. Quick validation whether it is really just timing for me was to use simply await() and the test started timing out. Thanks brandon.williams for pointing to the original ticket.

          e.dimitrova Ekaterina Dimitrova added a comment - I missed this one and opened a new one  CASSANDRA-16723  which I am closing as duplicate to this one. I didn't dig into it much, it doesn't fail for me locally. My first suspicion(and probably hope) was that the latch2.await();  method needs more time. Increasing it didn't make the test failing less. Quick validation whether it is really just timing for me was to use simply await() and the test started timing out. Thanks brandon.williams  for pointing to the original ticket.

          e.dimitrova I was looking into this one in passive mode. I have a couple improvements you can pull here. My 2cts

          bereng Berenguer Blasi added a comment - e.dimitrova I was looking into this one in passive mode. I have a couple improvements you can pull here . My 2cts

          Thanks, I already have them as indeed the error message was looking weird.  I will try to play a bit more with it today if nothing urgent pops up. 

          e.dimitrova Ekaterina Dimitrova added a comment - Thanks, I already have them as indeed the error message was looking weird.  I will try to play a bit more with it today if nothing urgent pops up. 

          I didn't manage to find anything more interesting and I am not able even to reproduce all errors you mentioned. 

          I will be off big part of the next 10 days so I unassigned it in case someone wants to give it a try in the meanwhile. 

          e.dimitrova Ekaterina Dimitrova added a comment - I didn't manage to find anything more interesting and I am not able even to reproduce all errors you mentioned.  I will be off big part of the next 10 days so I unassigned it in case someone wants to give it a try in the meanwhile. 

          The problem comes from trying to flush against an already closed connecttion. Adding some stacks for future reference:

          [junit-timeout] INFO  [Messaging-EventLoop-3-7] 2021-07-06 12:17:23,858 InboundConnectionInitiator.java:464 - /127.0.0.1:7012(/127.0.0.1:49576)->/127.0.0.1:7012-LARGE_MESSAGES-6cbecdc7 messaging connection established, version = 12, framing = LZ4, encryption = encrypted(factory=openssl;protocol=TLSv1.3;cipher=TLS_AES_128_GCM_SHA256)
          [junit-timeout] INFO  [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:17012), nic: lo, encryption: encrypted(openssl)
          [junit-timeout] INFO  [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:7012), nic: lo, encryption: optionally encrypted(openssl)
          [junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-07-06 12:17:23,868 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error
          [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:201)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955)
          [junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          [junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          [junit-timeout] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          [junit-timeout] 	at java.lang.Thread.run(Thread.java:748)
          [junit-timeout] Caused by: io.netty.handler.ssl.SslClosedEngineException: SSLEngine closed already
          [junit-timeout] 	at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:854)
          [junit-timeout] 	at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:811)
          [junit-timeout] 	at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:792)
          [junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
          [junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
          [junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
          [junit-timeout] 	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125)
          [junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
          [junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765)
          [junit-timeout] 	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
          [junit-timeout] 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
          [junit-timeout] 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
          [junit-timeout] 	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
          [junit-timeout] 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
          [junit-timeout] 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
          [junit-timeout] 	... 2 common frames omitted
          
          [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel this output stream was writing to has been closed
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248)
          [junit-timeout] 	at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
          [junit-timeout] 	at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955)
          [junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          [junit-timeout] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          [junit-timeout] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          [junit-timeout] 	at java.lang.Thread.run(Thread.java:748)
          [junit-timeout] Caused by: io.netty.channel.StacklessClosedChannelException: null
          [junit-timeout] 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
          

          Seems like waiting on AsyncPromise.isDone() on outbound.close fixes it but idk why yet.

          bereng Berenguer Blasi added a comment - The problem comes from trying to flush against an already closed connecttion. Adding some stacks for future reference: [junit-timeout] INFO [Messaging-EventLoop-3-7] 2021-07-06 12:17:23,858 InboundConnectionInitiator.java:464 - /127.0.0.1:7012(/127.0.0.1:49576)->/127.0.0.1:7012-LARGE_MESSAGES-6cbecdc7 messaging connection established, version = 12, framing = LZ4, encryption = encrypted(factory=openssl;protocol=TLSv1.3;cipher=TLS_AES_128_GCM_SHA256) [junit-timeout] INFO [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:17012), nic: lo, encryption: encrypted(openssl) [junit-timeout] INFO [main] 2021-07-06 12:17:23,865 InboundConnectionInitiator.java:127 - Listening on address: (/127.0.0.1:7012), nic: lo, encryption: optionally encrypted(openssl) [junit-timeout] WARN [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-07-06 12:17:23,868 OutboundConnection.java:488 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:201) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) [junit-timeout] at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748) [junit-timeout] Caused by: io.netty.handler.ssl.SslClosedEngineException: SSLEngine closed already [junit-timeout] at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:854) [junit-timeout] at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:811) [junit-timeout] at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:792) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728) [junit-timeout] at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765) [junit-timeout] at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) [junit-timeout] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [junit-timeout] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [junit-timeout] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [junit-timeout] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [junit-timeout] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [junit-timeout] ... 2 common frames omitted [junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel this output stream was writing to has been closed [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:230) [junit-timeout] at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) [junit-timeout] at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:110) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:986) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687) [junit-timeout] at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:955) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [junit-timeout] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.lang.Thread.run(Thread.java:748) [junit-timeout] Caused by: io.netty.channel.StacklessClosedChannelException: null [junit-timeout] at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source) Seems like waiting on AsyncPromise.isDone() on outbound.close fixes it but idk why yet.

          Yes, I think this is one of the things that also Brandon found. But in my logs when the test fails this warning doesn't appear  

          According to outbound.close, I saw it is invoked with flushQueue=false

          Did you mean the test is not failing anymore at all if you use  AsyncPromise.isDone()?

          Also, I wanted to confirm, did you trigger the warning in a specific way or the test just fails like that for you?

           

          e.dimitrova Ekaterina Dimitrova added a comment - Yes, I think this is one of the things that also Brandon found. But in my logs when the test fails this warning doesn't appear   According to outbound.close, I saw it is invoked with  flushQueue=false Did you mean the test is not failing anymore at all if you use  AsyncPromise.isDone() ? Also, I wanted to confirm, did you trigger the warning in a specific way or the test just fails like that for you?  

          There are a few test cases that can fail in this test. Idk if the all have a common root cause. But so far after diving among all the async, SSL and some netty code I can't pin down yet who is closing that connection. There are so many asyncs, threads, loose optimistic concurrent assumptions, etc... I don't see anything obvious in the netty code either but I don't know it thoroughly.

          I have been focusing on the outbound closing logic. But now I am focusing on the inbound side of things where a well placed Thread.sleep() seems to fix it (still to be confirmed). I am afraid I have only 3 options:

          • Download netty sources, add debug and continue the hunt. Worth it?
          • Add a Thread.sleep()
          • Mark as flaky

          I need to think a bit more about this...

          bereng Berenguer Blasi added a comment - There are a few test cases that can fail in this test. Idk if the all have a common root cause. But so far after diving among all the async, SSL and some netty code I can't pin down yet who is closing that connection. There are so many asyncs, threads, loose optimistic concurrent assumptions, etc... I don't see anything obvious in the netty code either but I don't know it thoroughly. I have been focusing on the outbound closing logic. But now I am focusing on the inbound side of things where a well placed Thread.sleep() seems to fix it (still to be confirmed). I am afraid I have only 3 options: Download netty sources, add debug and continue the hunt. Worth it? Add a Thread.sleep() Mark as flaky I need to think a bit more about this...

          This won't be ready for 4.0.0 GA cut and we're hitting a point of diminishing return. Parking it until there's gap where dealing with netty code makes sense again or I regain the appetite.

          bereng Berenguer Blasi added a comment - This won't be ready for 4.0.0 GA cut and we're hitting a point of diminishing return. Parking it until there's gap where dealing with netty code makes sense again or I regain the appetite.
          jmckenzie Josh McKenzie added a comment -

          Rather than dumping stack, might be worth it to catch the exceptions we know are benign and "expected" (i.e. access Channel after close) and printing out a slightly more useful error message with it, perhaps even pointing to this ticket.

           

          Not much precedent for that in the codebase as we obviously prefer to handle and fix things properly, but leading up to a .0 GA the extra hygiene might be worth it. Just a passing thought.

          jmckenzie Josh McKenzie added a comment - Rather than dumping stack, might be worth it to catch the exceptions we know are benign and "expected" (i.e. access Channel after close) and printing out a slightly more useful error message with it, perhaps even pointing to this ticket.   Not much precedent for that in the codebase as we obviously prefer to handle and fix things properly, but leading up to a .0 GA the extra hygiene might be worth it. Just a passing thought.
          bereng Berenguer Blasi added a comment - - edited

          A similar alternative would be to mark it flaky which I am very tempted of. The problem I have with these solutions is that I can't assure those exceptions being benign. So painting a GA green by cosmetics rather than addressing the problem is sthg I hard -1 to myself lol... Why do we access a channel after a close? Who closes it? etc

          I just need to take a break from this and when the next gap comes along get netty's code, put extra logging all around and see who is closing that connection. My 2tcs.

          But thx for the suggestion. All help welcomed.

          bereng Berenguer Blasi added a comment - - edited A similar alternative would be to mark it flaky which I am very tempted of. The problem I have with these solutions is that I can't assure those exceptions being benign. So painting a GA green by cosmetics rather than addressing the problem is sthg I hard -1 to myself lol... Why do we access a channel after a close? Who closes it? etc I just need to take a break from this and when the next gap comes along get netty's code, put extra logging all around and see who is closing that connection. My 2tcs. But thx for the suggestion. All help welcomed.
          jmckenzie Josh McKenzie added a comment -

          I believe the last run of the numbers showed something like 13% of flaky tests being actual product defects. I'm personally not very comfortable marking things as flaky unless we have a strong understanding of why the tests are flaking out, and at that point we probably know more than enough to fix them anyway.

          jmckenzie Josh McKenzie added a comment - I believe the last run of the numbers showed something like 13% of flaky tests being actual product defects. I'm personally not very comfortable marking things as flaky unless we have a strong understanding of why the tests are flaking out, and at that point we probably know more than enough to fix them anyway.
          azotcsit Aleksei Zotov added a comment -

          I can see it failed again for 935 and 937 builds. I linked this ticket to the failure Butler. Attaching the full log for 937 build (in case someone may want to check that in the future):

          TEST-org.apache.cassandra.net.ConnectionTest.log.xz

           

          azotcsit Aleksei Zotov added a comment - I can see it failed again for 935  and 937 builds. I linked this ticket to the failure Butler. Attaching the full log for 937 build (in case someone may want to check that in the future): TEST-org.apache.cassandra.net.ConnectionTest.log.xz  
          azotcsit Aleksei Zotov added a comment -

          Also I can see that https://ci-cassandra.apache.org/job/Cassandra-trunk/936/testReport/org.apache.cassandra.net/ConnectionTest/testTimeout/ failed once for trunk. However, the error message is pretty much the same, so linking to this ticket as well.

          azotcsit Aleksei Zotov added a comment - Also I can see that https://ci-cassandra.apache.org/job/Cassandra-trunk/936/testReport/org.apache.cassandra.net/ConnectionTest/testTimeout/ failed once for trunk. However, the error message is pretty much the same, so linking to this ticket as well.
          adelapena Andres de la Peña added a comment - More failures on https://ci-cassandra.apache.org/job/Cassandra-trunk/1032/testReport/org.apache.cassandra.transport/CQLConnectionTest They can be reproduced with the multiplexer: https://app.circleci.com/pipelines/github/adelapena/cassandra/1405/workflows/88fb8e4f-7a3f-42f9-a6db-5d14a652b54e
          jlewandowski Jacek Lewandowski added a comment - https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/909/workflows/8c169ac2-a284-426b-9a64-510e966e63d5/jobs/32408

          People

            Unassigned Unassigned
            brandon.williams Brandon Williams
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated: