Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-28115

Flink 1.15.0 Parallelism Rebalance causes flink job failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.15.0
    • None
    • None
    • None
    • Flink 1.15.0 session cluster with 8 hosts.

    Description

      Issue:

      Flink 1.15.0 Parallelism Rebalance causes flink job failure. Same issue was not in flink 1.14.4.

      Exceptions:

      1 of the 8 re-balance parallelism task slots failed due to 
      org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /127.0.0.1:43354
      Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused

       

      Job topology:

      kafkaSource (parallelism 4) -> map/processer (parallelism 8) -> kafkaSink (4 parallelism)

      Our dev flink cluster has 8 hosts, each host has 25 task managers alive. Each TM has 2 task slots

       
       
      Error stack trace:
      2022-06-17 12:54:38.563 WARN  [Framework] Map (3/8)#5 org.apache.flink.runtime.taskmanager.Task  - Map (3/8)#5 (69a82f741d68fd7161d7b13de48c6c4b) switched from RUNNING to FAILED with failure cause: org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: Connection for partition 2064424258b3b74fdc349607017f1029#1@a94744160d7d5e85101881e2d783dcd2 not reachable.
          at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:190)
          at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:342)
          at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:312)
          at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:115)
          at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
          at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
          at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:353)
          at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:317)
          at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753)
          at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
          at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
          at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741)
          at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
          at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager '/127.0.0.1:43354' has failed. This might indicate that the remote task manager has been lost.
          at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:169)
          at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:135)
          at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:96)
          at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:95)
          at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:186)
          ... 15 more
      Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /127.0.0.1:43354
      Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
          at org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
          at org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
          at org.apache.flink.shaded.netty4.io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
          at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
          at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
          at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
          at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
          at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
          at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
          at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
          at java.lang.Thread.run(Thread.java:748)

      Attachments

        Issue Links

          Activity

            People

              gaoyunhaii Yun Gao
              huameng Huameng Li
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: