[FLINK-17470] Flink task executor process permanently hangs on `flink-daemon.sh stop`, deletes PID file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.10.0
Fix Version/s: 1.12.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available
Environment:
Hide

$ uname -a Linux hostname.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux $ lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.7.1908 (Core) Release: 7.7.1908 Codename: Core

Flink version 1.10
Show
$ uname -a Linux hostname.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux $ lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.7.1908 (Core) Release: 7.7.1908 Codename: Core Flink version 1.10

Release Note:
In Flink 1.12 we changed the behavior of the standalone scripts to issue a SIGKILL if a SIGTERM did not succeed in shutting down a Flink process.

Description

Hi Flink team!

We've attempted to upgrade our flink 1.9 cluster to 1.10, but are experiencing reproducible instability on shutdown. Speciically, it appears that the `kill` issued in the `stop` case of flink-daemon.sh is causing the task executor process to hang permanently. Specifically, the process seems to be hanging in the `org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run` in a `Thread.sleep()` call. I think this is a bizarre behavior. Also note that every thread in the process is BLOCKED. on a `pthread_cond_wait` call. Is this an OS level issue? Banging my head on a wall here. See attached stack traces for details.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flink_jstack.log
29/Apr/20 20:17
244 kB
Hunter Herman
flink_mixed_jstack.log
29/Apr/20 20:17
155 kB
Hunter Herman

Issue Links

is related to

FLINK-16510 Task manager safeguard shutdown may not be reliable

Closed

links to

GitHub Pull Request #14062

GitHub Pull Request #14128

Activity

People

Assignee:: Robert Metzger

Reporter:: Hunter Herman

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Apr/20 20:17

Updated:: 20/Nov/20 12:04

Resolved:: 20/Nov/20 12:04