Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5840

A query that includes sort completes, and then loses Drill connection. Drill becomes unresponsive, and cannot restart because it cannot communicate with Zookeeper

    XMLWordPrintableJSON

Details

    Description

      Query is:

      ALTER SESSION SET `exec.sort.disable_managed` = false;
      select count(*) from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf';
      

      Query tries to complete, but cannot. It takes 20 hours from the time the query tries to complete, to the time Drill finally loses its connection.

      From the drillbit.log:

      2017-10-03 16:28:14,892 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: State change requested RUNNING --> COMPLETED
      2017-10-04 01:47:27,698 [UserServer-1] DEBUG o.a.d.e.r.u.UserServerRequestHandler - Received query to run.  Returning query handle.
      2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN  o.a.d.exec.work.foreman.QueryManager - Failure while trying to delete the estore profile for this query.
      org.apache.drill.common.exceptions.DrillRuntimeException: unable to delete node at /running/262bec7f-3539-0dd7-6fea-f2959f9df3b6
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:343) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.coord.zk.ZkEphemeralStore.remove(ZkEphemeralStore.java:108) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:293) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:964) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman.access$2600(Foreman.java:113) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1025) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1018) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.common.EventProcessor.processEvents(EventProcessor.java:107) [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.common.EventProcessor.sendEvent(EventProcessor.java:65) [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.addEvent(Foreman.java:1020) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman.addToEventQueue(Foreman.java:1038) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager.nodeComplete(QueryManager.java:498) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager.access$100(QueryManager.java:66) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager$NodeTracker.fragmentComplete(QueryManager.java:462) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager.fragmentDone(QueryManager.java:147) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager.access$400(QueryManager.java:66) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:525) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.control.WorkEventBus.statusUpdate(WorkEventBus.java:71) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.fragment.FragmentStatusReporter.sendStatus(FragmentStatusReporter.java:124) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.fragment.FragmentStatusReporter.stateChanged(FragmentStatusReporter.java:94) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:304) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:267) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_111]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_111]
      	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
      Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
      	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:238) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:233) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:214) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:41) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:340) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	... 27 common frames omitted
      2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: cleaning up.
      2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.d.exec.rpc.control.WorkEventBus - Removing fragment status listener for queryId 262bec7f-3539-0dd7-6fea-f2959f9df3b6.
      2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: State change requested COMPLETED --> CANCELLATION_REQUESTED
      2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN  o.a.drill.exec.work.foreman.Foreman - Dropping request to move to CANCELLATION_REQUESTED state as query is already at COMPLETED state (which is terminal).
      2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.d.e.w.f.FragmentStatusReporter - Closing org.apache.drill.exec.work.fragment.FragmentStatusReporter@75e9a886
      2017-10-04 03:30:02,970 [drill-executor-42] DEBUG o.a.d.exec.rpc.control.WorkEventBus - Removing fragment manager: 262bec7f-3539-0dd7-6fea-f2959f9df3b6:0:0
      2017-10-04 12:49:47,660 [UserServer-1] WARN  o.a.drill.exec.rpc.user.UserServer - Message of mode REQUEST of rpc type 3 took longer than 500ms.  Actual duration was 39739962ms.
      2017-10-04 12:49:47,662 [UserServer-1] ERROR o.a.d.exec.rpc.RpcExceptionHandler - Exception in RPC communication.  Connection: /10.10.100.190:31010 <--> /10.10.100.190:37568 (user server).  Closing connection.
      io.netty.handler.codec.DecoderException: org.apache.drill.common.exceptions.DrillRuntimeException: unable to put 
      	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99) [netty-codec-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:150) [netty-handler-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242) [netty-codec-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
      Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: unable to put 
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:300) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:246) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.coord.zk.ZkEphemeralStore.put(ZkEphemeralStore.java:79) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:285) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.foreman.Foreman.<init>(Foreman.java:180) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.work.user.UserWorker.submitWork(UserWorker.java:78) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:66) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:47) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:157) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:53) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) [netty-codec-4.0.27.Final.jar:4.0.27.Final]
      	... 22 common frames omitted
      Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: error while checking path on zookeeper
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:139) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:107) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:273) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	... 35 common frames omitted
      Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
      	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:158) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36) ~[curator-framework-2.7.1.jar:na]
      	at org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:130) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	... 37 common frames omitted
      2017-10-04 12:49:47,667 [UserServer-1] INFO  o.a.drill.exec.rpc.user.UserServer - RPC connection /10.10.100.190:31010 <--> /10.10.100.190:37579 (user server) timed out.  Timeout was set to 30 seconds. Closing connection.
      2017-10-04 12:49:47,668 [UserServer-1] INFO  o.a.drill.exec.work.foreman.Foreman - Failure while trying communicate query result to initiating client. This would happen if a client is disconnected before response notice can be sent.
      org.apache.drill.exec.rpc.ChannelClosedException: Channel closed /10.10.100.190:31010 <--> /10.10.100.190:37579.
      	at org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:164) [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:144) [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:406) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:943) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:592) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:584) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1099) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:71) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:466) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:187) [netty-handler-4.0.27.Final.jar:4.0.27.Final]
      	at org.apache.drill.exec.rpc.BasicServer$LoggingReadTimeoutHandler.readTimedOut(BasicServer.java:122) [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      	at io.netty.handler.timeout.ReadTimeoutHandler$ReadTimeoutTask.run(ReadTimeoutHandler.java:212) [netty-handler-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
      	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) [netty-common-4.0.27.Final.jar:4.0.27.Final]
      	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
      

      When I re-connect to Drill and try to show schemas, Drill hangs, and I see this in drillbit.log

      2017-10-04 13:01:01,414 [UserServer-1] DEBUG o.a.d.e.r.u.UserServerRequestHandler - Received query to run.  Returning query handle.
      

      I try to stop Drill, but it doesn't stop, so it gets killed.

      [root@qa-node190 logs.67]# /opt/drill/bin/drillbit.sh stop
      Stopping drillbit
      .........................................................................................................................
      drillbit did not complete after 120 seconds, killing with kill -9 14550
      

      I try to start Drill

      /opt/drill/bin/drillbit.sh start
      

      But it fails because it cannot communicate with Zookeeper:

      2017-10-04 14:40:32,812 [main] DEBUG o.a.d.e.c.zk.ZKClusterCoordinator - Starting ZKClusterCoordination.
      2017-10-04 14:40:42,820 [main] ERROR o.apache.drill.exec.server.Drillbit - Failure during initial startup of Drillbit.
      java.io.IOException: Failure to connect to the zookeeper cluster service within the allotted time of 10000 milliseconds.
              at org.apache.drill.exec.coord.zk.ZKClusterCoordinator.start(ZKClusterCoordinator.java:123) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:153) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:343) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      2017-10-04 14:40:42,820 [main] DEBUG o.apache.drill.exec.server.Drillbit - Shutdown begun.
      2017-10-04 14:40:44,714 [main] WARN  o.apache.drill.exec.server.Drillbit - Failure on close()
      java.lang.NullPointerException: null
              at org.apache.drill.exec.work.WorkManager.close(WorkManager.java:152) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:199) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:346) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
              at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
      2017-10-04 14:40:44,715 [main] INFO  o.apache.drill.exec.server.Drillbit - Shutdown completed (1894 ms).
      

      I normally restart warden and zookeeper to resolve this. When I stopped warden, it took 20 minutes (normally it should take a few minutes or less):

      [root@qa-node190 logs.67]# time service mapr-warden stop
      stopping WARDEN
      looking to stop mapr-core processes not started by warden
      
      real	20m46.637s
      user	0m1.622s
      sys	0m2.432s
      

      Stopping zookeeper was quick, and restarting all processes including Drill completed as expected.

      Attachments

        Activity

          People

            paul-rogers Paul Rogers
            rhou Robert Hou
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: