Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-942

NettyClientBase throws RejectedExecutionException occasionally.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: RPC
    • Labels:
      None

      Description

      NettyClientBase throws RejectedExecutionException occasionally.

      For example, add following simple codes to unit test cases.

        @Test
        public final void testShutdownCluster() throws Exception {
          TajoTestingCluster activeMaster = new TajoTestingCluster();
          activeMaster.startMiniCluster(1);
          activeMaster.shutdownMiniCluster();
        }
      

      If you added above codes, run 'mvn clean install', and then you can find infinite loop as follows:

      2014-07-15 10:36:12,217 ERROR: org.apache.tajo.rpc.AsyncRpcClient (exceptionCaught(235)) - RPC Exception:java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
      2014-07-15 10:36:12,218 ERROR: org.apache.tajo.worker.WorkerHeartbeatService (run(241)) - java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
      java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
      	at org.apache.tajo.rpc.NettyClientBase.connect(NettyClientBase.java:93)
      	at org.apache.tajo.rpc.RpcConnectionPool.getConnection(RpcConnectionPool.java:89)
      	at org.apache.tajo.worker.WorkerHeartbeatService$WorkerHeartbeatThread.run(WorkerHeartbeatService.java:220)
      Caused by: java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
      	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:115)
      	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.register(AbstractNioSelector.java:100)
      	at org.jboss.netty.channel.socket.nio.NioClientBoss.register(NioClientBoss.java:42)
      	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:121)
      	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
      	at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54)
      	at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54)
      	at org.jboss.netty.channel.Channels.connect(Channels.java:634)
      	at org.jboss.netty.channel.AbstractChannel.connect(AbstractChannel.java:207)
      	at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229)
      	at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182)
      	at org.apache.tajo.rpc.NettyClientBase.connect(NettyClientBase.java:76)
      

        Activity

        Hide
        blrunner Jaehwa Jung added a comment - - edited

        Hi, guys,

        I tried a few test to resolve this bug as follows:

        • Case 1
          1. TajoMaster::stop
            //RpcChannelFactory.shutdown()
            //super.stop();
            
          2. TajoWorker::stop
            //connPool.shutdown();
            //RpcChannelFactory.shutdown();
            

            In this case, the unit test finished successfully.

        • Case 2
          1. TajoMaster::stop
            //super.stop();
            
          2. RpcChannelFactory::shutdown
            //factory.releaseExternalResources();
            
          3. RpcConnectionPool::shutdown
            //factory.releaseExternalResources();
            

            In this case, the unit test finished successfully.

        • Case 3
          1. RpcChannelFactory::shutdown
            //factory.releaseExternalResources();
            
          2. RpcConnectionPool::shutdown
            //factory.releaseExternalResources();
            

            In this case, NettyClientBase::connect goes into infinite loop.

        • Case 4
          1. When all component call NettyClientBase and NettyServerBase, they set service name unique name with host address. But in this case, NettyClientBase::connect goes into infinite loop.
        • Case 5
          1. RpcChannelFactory::getSharedClientChannelFactory
              public static synchronized ClientSocketChannelFactory getSharedClientChannelFactory(){
                //shared woker and boss pool
                TajoConf conf = new TajoConf();
                int workerNum = conf.getIntVar(TajoConf.ConfVars.INTERNAL_RPC_CLIENT_WORKER_THREAD_NUM);
                return createClientChannelFactory("Internal-Client", workerNum);
              }
            
          2. WorkerHeartbeatThread::run cause RejectedExecutionException.

        I think that tajo share rpc channel in context member instead of static member. And if we update this architecture, it can be affect other codes and performance. Thus we need to discuss about this issue, and if you guys agree to resolve it, we need to handle it at another jira issue.

        Show
        blrunner Jaehwa Jung added a comment - - edited Hi, guys, I tried a few test to resolve this bug as follows: Case 1 TajoMaster::stop //RpcChannelFactory.shutdown() //super.stop(); TajoWorker::stop //connPool.shutdown(); //RpcChannelFactory.shutdown(); In this case, the unit test finished successfully. Case 2 TajoMaster::stop //super.stop(); RpcChannelFactory::shutdown //factory.releaseExternalResources(); RpcConnectionPool::shutdown //factory.releaseExternalResources(); In this case, the unit test finished successfully. Case 3 RpcChannelFactory::shutdown //factory.releaseExternalResources(); RpcConnectionPool::shutdown //factory.releaseExternalResources(); In this case, NettyClientBase::connect goes into infinite loop. Case 4 When all component call NettyClientBase and NettyServerBase, they set service name unique name with host address. But in this case, NettyClientBase::connect goes into infinite loop. Case 5 RpcChannelFactory::getSharedClientChannelFactory public static synchronized ClientSocketChannelFactory getSharedClientChannelFactory(){ //shared woker and boss pool TajoConf conf = new TajoConf(); int workerNum = conf.getIntVar(TajoConf.ConfVars.INTERNAL_RPC_CLIENT_WORKER_THREAD_NUM); return createClientChannelFactory( "Internal-Client" , workerNum); } WorkerHeartbeatThread::run cause RejectedExecutionException. I think that tajo share rpc channel in context member instead of static member. And if we update this architecture, it can be affect other codes and performance. Thus we need to discuss about this issue, and if you guys agree to resolve it, we need to handle it at another jira issue.

          People

          • Assignee:
            blrunner Jaehwa Jung
            Reporter:
            blrunner Jaehwa Jung
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development