Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.13.0, 1.14.0
Description
1) Observations
a) The Azure pipeline would occasionally hang without printing any test error information.
b) By running the test KafkaSourceLegacyITCase::testBrokerFailure() with INFO level logging, the the test would hang with the following error message printed repeatedly:
20451 [New I/O boss #50] ERROR org.apache.flink.networking.NetworkFailureHandler [] - Closing communication channel because of an exception java.net.ConnectException: Connection refused: localhost/127.0.0.1:50073 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_151] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_151] at org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) ~[flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.shaded.testutils.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.shaded.testutils.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
2) Root cause explanations
The test would hang because it enters the following loop:
- closeOnFlush() is called for a given channel
- closeOnFlush() calls channel.write(..)
- channel.write() triggers the exceptionCaught(...) callback
- closeOnFlush() is called for the same channel again.
3) Solution
Update closeOnFlush() so that, if a channel is being closed by this method, then closeOnFlush() would not try to write to this channel if it is called on this channel again.
Attachments
Issue Links
- is caused by
-
FLINK-23223 When flushAlways is enabled the subpartition may lose notification of data availability
- Closed
- relates to
-
FLINK-22457 KafkaSourceLegacyITCase.testMultipleSourcesOnePartition fails because of timeout
- Closed
-
FLINK-22416 UpsertKafkaTableITCase hangs when collecting results
- Closed
-
FLINK-22520 KafkaSourceLegacyITCase.testMultipleSourcesOnePartition hangs
- Closed
- links to