Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.18.0, 1.16.3, 1.17.2
-
Fixes a bug in the RestClient where making a request after the client was closed returns a future that never completes.
Description
The RestClient can deadlock if a request is made after the Netty event executor has terminated.
This is due to the listener that would resolve the CompletableFuture that is attached to the ChannelFuture returned by the call to Netty to connect not being able to run because the executor to run it rejects the execution.
final ChannelFuture connectFuture = bootstrap.connect(targetAddress, targetPort); final CompletableFuture<Channel> channelFuture = new CompletableFuture<>(); connectFuture.addListener( (ChannelFuture future) -> { if (future.isSuccess()) { channelFuture.complete(future.channel()); } else { channelFuture.completeExceptionally(future.cause()); } });
In this code, the call to addListener() can fail silently (only logging to the console), meaning any code waiting on the CompletableFuture returned by this method will deadlock.
There was some work in Netty around this back in 2015, but it's unclear to me how this situation is expected to be handled given the discussion and changes from these issues:
- https://github.com/netty/netty/issues/3449 (open)
- https://github.com/netty/netty/pull/3483 (closed)
- https://github.com/netty/netty/pull/3566 (closed)
- https://github.com/netty/netty/pull/5087 (merged)
I think a reasonable fix for Flink would be to check the state of connectFuture and channelFuture immediately after the call to addListener(), resolving channelFuture with completeExceptionally() if connectFuture is done and failed and channelFuture has not been completed. In the possible race condition where the listener was attached successfully and the connection fails instantly, the result is the same, as calls to CompletableFuture#completeExceptionally() are idempotent.
A workaround for users of RestClient is to call CompletableFuture#get(long timeout, TimeUnit unit) rather than #get() or #join() on the CompletableFutures it returns. However, if the call throws TimeoutException, the cause of the failure cannot easily be determined.