Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3576

An NPE thrown in Connection.exceptionCaught() makes the connection to corresponding tablet server unusable

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
    • 1.18.0, 1.17.1
    • client, java
    • None

    Description

      If a Kudu Java client application keeps a connection to a tablet server open and the tablet server is killed/restarted or a network error happens on the connection, the client application might end up in a state when it cannot communicate with the tablet server even after the tablet server is up and running again. If the application tries to write to any tablet replica that is hosted at the tablet server, all such requests will timeout on the very first attempt, and the state of the connection to the server remains in a limbo since then. The only way to get out of the trouble is to recreate the affected Java Kudu client instance, e.g., by restarting the application.

      More details are below.

      Once the NPE is thrown by Connection.exceptionCaught() upon an attempt to access null ctx variable of the ChannelHandlerContext type, all the subsequent attempts to send Write RPC to any tablet replica hosted at the tablet server end up with a timeout on a very first attempt (i.e. there are no retries):

      java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
       Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
       Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
       Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
       Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
       Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}
      

      The root cause of the problem manifests itself as an NPE in Connection.exceptionCaught() with a stack trace like below:

      24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by org.apache.kudu.client.Connection$1.operationComplete()
       java.lang.NullPointerException
        at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434)
        at org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746)
        at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
        at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
        at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
        at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
        at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
        at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
        at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
        at org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
        at org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
        at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
        at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
        at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
        at org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
        at org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
        at org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
      

      The issue was introduced with KUDU-1438 in changelist 57dda5d48.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aserbin Alexey Serbin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: