Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2589

ITClient is flaky under stress when leadership changes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 1.8.0
    • NA
    • test
    • None

    Description

      Saw this failure in ITClient test:

      noformat
      00:34:51.362 DEBUG - New I/O worker #10 (AsyncKuduScanner.java:492) Can not open scanner
      org.apache.kudu.client.NonRecoverableException: Tablet hasn't heard from leader, or there hasn't been a stable leader for: 0.757s secs, (max is 0.750s):
      at org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:320)
      at org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:242)
      at org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59)
      at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:131)
      at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:127)
      at org.apache.kudu.client.Connection.messageReceived(Connection.java:391)
      at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
      at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243)
      at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
      at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
      at org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
      at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
      at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
      at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
      at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
      at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
      at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
      at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
      at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
      at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
      at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
      at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
      at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
      at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
      at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
      at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
      at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
      at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
      at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
      at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
      at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
      at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
      at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
      at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      noformat

      There was a new leader elected just before:

      noformat
      00:34:50.953 [INFO - cluster stderr printer] (MiniKuduCluster.java:526) I0925 00:34:50.953722 18257 catalog_manager.cc:3758] T 07e99535cba24d8e991829485de22275 P 060cf49269fb4f6f901696d741e69303 reported cstate change: term changed from 1 to 2, leader changed from acaedc4ec505489dbc853d8b32bfc147 (127.16.196.2) to 060cf49269fb4f6f901696d741e69303 (127.16.196.3). New cstate: current_term: 2 leader_uuid: "060cf49269fb4f6f901696d741e69303" committed_config { opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "acaedc4ec505489dbc853d8b32bfc147" member_type: VOTER last_known_addr

      { host: "127.16.196.2" port: 58760 }

      health_report

      { overall_health: UNKNOWN }

      } peers { permanent_uuid: "060cf49269fb4f6f901696d741e69303" member_type: VOTER last_known_addr

      { host: "127.16.196.3" port: 37181 }

      health_report

      { overall_health: HEALTHY }

      } peers { permanent_uuid: "2f0ee942338c4fb7bce908b1c32f0bcb" member_type: VOTER last_known_addr

      { host: "127.16.196.1" port: 47865 }

      health_report

      { overall_health: UNKNOWN }

      } }
      noformat

      but it seems the 0.75 seconds the async scanner will wait for a leader partially overlapped with the election, and the remaining portion of the period wasn't enough to get an answer on who the new leader was.

      Attachments

        Activity

          People

            Unassigned Unassigned
            wdberkeley William Berkeley
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: