Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10941

Slots prematurely released which still contain unconsumed data

    XMLWordPrintableJSON

    Details

      Description

      Our case is: Flink 1.5 batch mode, 32 parallelism to read data source and 4 parallelism to write data sink.
       
      The read task worked perfectly with 32 TMs. However when the job was executing the write task, since only 4 TMs were needed, other 28 TMs were released. This caused RemoteTransportException in the write task:
       
      org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager ’the_previous_TM_used_by_read_task'. This might indicate that the remote task manager was lost.
      at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:133)
      at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
      ...
       
      After skimming YarnFlinkResourceManager related code, it seems to me that Flink is releasing TMs when they’re idle, regardless of whether working TMs need them.
       
      Put in another way, Flink seems to prematurely release slots which contain unconsumed data and, thus, eventually release a TM which then fails a consuming task.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                azagrebin Andrey Zagrebin
                Reporter:
                QiLuo Qi
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m