Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-15074

Connection timed out, Standalone cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Abandoned
    • 1.9.1
    • None
    • Runtime / Network
    • None
    • flink version : 1.5.1 , 1.9.1

      jdk version : 1.8.0_181

      Number of servers : 15

      Number of taskmanagers : 178

      Number of slots: 178

    Description

      I am running a flink streaming application on  a standalone-cluster.

      It works well when the job's parallelism is low, just like 96.

      But when I try to increase job's parallelism  to a high value, like 164 or more,  Job will fail in 10-15 minutes due to connection timeout error

      I have try to solve this problem by increaseing taskmanager configs just like 'taskmanager.network.netty.server.numThreads', 'taskmanager.network.netty.client.numThreads', 'taskmanager.network.request-backoff.max', 'akka.ask.timeout' and so on, It doesn't work.

      I also try to change different versions of flink, such as 1.5.1 and 1.9.1, to solve this problem , it doesn't help too. 

      Does anyone know how to fix this problem,I have no idea now. It looks like a bug.

      I hava upload my config and log as attachment, and the error trace below :

       

      ------------------------------------------------------------------

      org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: Connection timed out
      at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:172) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
      Caused by: java.io.IOException: Connection timed out
      at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_181]
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_181]
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_181]
      at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_181]
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_181]
      at org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
      ... 6 common frames omitted

      Attachments

        1. flink-conf.yaml
          10 kB
          gameking
        2. jobmanager.log
          8 kB
          gameking
        3. taskmanager.log
          65 kB
          gameking

        Activity

          People

            Unassigned Unassigned
            gameking gameking
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: