Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
-
None
Description
If a Kudu Java client application keeps a connection to a tablet server open and the tablet server is killed/restarted or a network error happens on the connection, the client application might end up in a state when it cannot communicate with the tablet server even after the tablet server is up and running again. If the application tries to write to any tablet replica that is hosted at the tablet server, all such requests will timeout on the very first attempt, and the state of the connection to the server remains in a limbo since then. The only way to get out of the trouble is to recreate the affected Java Kudu client instance, e.g., by restarting the application.
More details are below.
Once the NPE is thrown by Connection.exceptionCaught() upon an attempt to access null ctx variable of the ChannelHandlerContext type, all the subsequent attempts to send Write RPC to any tablet replica hosted at the tablet server end up with a timeout on a very first attempt (i.e. there are no retries):
java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}
The root cause of the problem manifests itself as an NPE in Connection.exceptionCaught() with a stack trace like below:
24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by org.apache.kudu.client.Connection$1.operationComplete() java.lang.NullPointerException at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434) at org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) at org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) at org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995) at org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
The issue was introduced with KUDU-1438 in changelist 57dda5d48.
Attachments
Issue Links
- is related to
-
KUDU-1438 [java client] Upgrade to Netty 4
- Resolved