[FLINK-10941] Slots prematurely released which still contain unconsumed data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.5, 1.6.2, 1.7.0
Fix Version/s: 1.7.3, 1.8.1, 1.9.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

Our case is: Flink 1.5 batch mode, 32 parallelism to read data source and 4 parallelism to write data sink.

The read task worked perfectly with 32 TMs. However when the job was executing the write task, since only 4 TMs were needed, other 28 TMs were released. This caused RemoteTransportException in the write task:

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager ’the_previous_TM_used_by_read_task'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:133)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
...

After skimming YarnFlinkResourceManager related code, it seems to me that Flink is releasing TMs when they’re idle, regardless of whether working TMs need them.

Put in another way, Flink seems to prematurely release slots which contain unconsumed data and, thus, eventually release a TM which then fails a consuming task.

Attachments

Issue Links

is duplicated by

FLINK-12106 Jobmanager is killing FINISHED taskmanger containers, causing exception in still running Taskmanagers an

Closed

relates to

FLINK-12736 ResourceManager may release TM with allocated slots

Resolved

FLINK-12193 Send TM "can be released status" with RM heartbeat

Open

FLINK-12069 Add proper lifecycle management for intermediate result partitions

Closed

links to

GitHub Pull Request #7186

GitHub Pull Request #7938

GitHub Pull Request #8201

(2 links to)

Activity

People

Assignee:: Andrey Zagrebin

Reporter:: Qi

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 20/Nov/18 05:04

Updated:: 04/Jul/19 08:55

Resolved:: 19/Apr/19 08:52

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m