[FLINK-30844] TaskTest.testInterruptibleSharedLockInInvokeAndCancel causes a JVM shutdown with exit code 239 - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.17.0, 1.16.2
Fix Version/s: 1.17.1, 1.16.3
Component/s: Runtime / Coordination, Runtime / Task
Labels:
- pull-request-available
- test-stability

Description

We're experiencing a fatal crash in TaskTest:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45440&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=8334

[...]
Jan 31 01:03:12 [ERROR] Process Exit Code: 239
Jan 31 01:03:12 [ERROR] Crashed tests:
Jan 31 01:03:12 [ERROR] org.apache.flink.runtime.taskmanager.TaskTest
Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:748)
Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$700(ForkStarter.java:121)
Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:393)
Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:370)
Jan 31 01:03:12 [ERROR] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jan 31 01:03:12 [ERROR] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Jan 31 01:03:12 [ERROR] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Jan 31 01:03:12 [ERROR] at java.lang.Thread.run(Thread.java:748)
Jan 31 01:03:12 [ERROR] -> [Help 1]
Jan 31 01:03:12 [ERROR] 
Jan 31 01:03:12 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
Jan 31 01:03:12 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
Jan 31 01:03:12 [ERROR] 
Jan 31 01:03:12 [ERROR] For more information about the errors and possible solutions, please read the following articles:
Jan 31 01:03:12 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Jan 31 01:03:12 [ERROR] 
Jan 31 01:03:12 [ERROR] After correcting the problems, you can resume the build with the command
Jan 31 01:03:12 [ERROR]   mvn <goals> -rf :flink-runtime

Attachments

Issue Links

Add Link

relates to

FLINK-32972 TaskTest.testInterruptibleSharedLockInInvokeAndCancel causes a JVM shutdown with exit code 239

Open

Delete this link

Testing discovered

FLINK-30852 TaskTest.testCleanupWhenSwitchToInitializationFails reports AssertionError but doesn't fail

Resolved

Delete this link

links to

GitHub Pull Request #22354

Delete this link

Activity

Ascending order - Click to sort in descending order

Matthias Pohl added a comment - 31/Jan/23 12:02

An AssertionError is reported that should be independent from this build failure. I created ~~FLINK-30852~~ to cover the issue.

Matthias Pohl added a comment - 31/Jan/23 12:02 An AssertionError is reported that should be independent from this build failure. I created FLINK-30852 to cover the issue.

Matthias Pohl added a comment - 31/Jan/23 12:14

TaskTest.testInterruptibleSharedLockInInvokeAndCancel caused the failure

00:59:02,291 [Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).] ERROR org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: Thread 'Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).' produced an uncaught exception. Stopping the process...
org.apache.flink.util.FlinkRuntimeException: Error in Task Cancellation Watch Dog
        at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1801) ~[classes/:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.lang.RuntimeException: Unexpected FatalError notification
        at org.apache.flink.runtime.taskmanager.TaskTest$ProhibitFatalErrorTaskManagerActions.notifyFatalError(TaskTest.java:1278) ~[test-classes/:?]
        at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1798) ~[classes/:?]
        ... 1 more

The TaskCancelerWatchDog causes the System.exit when the executor thread is still alive (see Task:1781).

Piotr Nowicki Anton Kalashnikov May one of you have a look at this?

Matthias Pohl added a comment - 31/Jan/23 12:14 TaskTest.testInterruptibleSharedLockInInvokeAndCancel caused the failure 00:59:02,291 [Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).] ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).' produced an uncaught exception. Stopping the process... org.apache.flink.util.FlinkRuntimeException: Error in Task Cancellation Watch Dog at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1801) ~[classes/:?] at java.lang. Thread .run( Thread .java:748) [?:1.8.0_292] Caused by: java.lang.RuntimeException: Unexpected FatalError notification at org.apache.flink.runtime.taskmanager.TaskTest$ProhibitFatalErrorTaskManagerActions.notifyFatalError(TaskTest.java:1278) ~[test-classes/:?] at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1798) ~[classes/:?] ... 1 more The TaskCancelerWatchDog causes the System.exit when the executor thread is still alive (see Task:1781 ). Piotr Nowicki Anton Kalashnikov May one of you have a look at this?

Piotr Nowicki added a comment - 31/Jan/23 15:02

Matthias Pohl I believe I'm not the person you wanted to notify here

Piotr Nowicki added a comment - 31/Jan/23 15:02 Matthias Pohl I believe I'm not the person you wanted to notify here

Matthias Pohl added a comment - 31/Jan/23 15:12 - edited

ah, typo! 🤦‍♂️ sorry for the spam, Piotr Nowicki. I meant Piotr Nowojski

Matthias Pohl added a comment - 31/Jan/23 15:12 - edited ah, typo! 🤦‍♂️ sorry for the spam, Piotr Nowicki. I meant Piotr Nowojski

Anton Kalashnikov added a comment - 31/Jan/23 18:11

I don't see a problem here. I just see that the thread is finishing longer than 50ms but I see it isn't stuck and make progress which is good. Normally, it should take about 1ms but 50ms is also not something extraordinal on the overloaded machine. The only thing is a synchronization inside the finishing loop which can delay the actual finish but I have no evidence that it is a reason.
I will try to check more ideas but if nothing works I will just increase the waiting interval for this test.(I actually think that 50ms is low anyway)

Anton Kalashnikov added a comment - 31/Jan/23 18:11 I don't see a problem here. I just see that the thread is finishing longer than 50ms but I see it isn't stuck and make progress which is good. Normally, it should take about 1ms but 50ms is also not something extraordinal on the overloaded machine. The only thing is a synchronization inside the finishing loop which can delay the actual finish but I have no evidence that it is a reason. I will try to check more ideas but if nothing works I will just increase the waiting interval for this test.(I actually think that 50ms is low anyway)

Matthias Pohl added a comment - 01/Feb/23 11:13

thanks for sharing your thoughts. It sounds reasonable. I'm gonna lower the priority for that one to Major.

Matthias Pohl added a comment - 01/Feb/23 11:13 thanks for sharing your thoughts. It sounds reasonable. I'm gonna lower the priority for that one to Major.

Matthias Pohl added a comment - 20/Mar/23 07:37

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47318&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=8351

Matthias Pohl added a comment - 20/Mar/23 07:37 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47318&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=8351

Matthias Pohl added a comment - 31/Mar/23 06:58

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47748&view=logs&j=a549b384-c55a-52c0-c451-00e0477ab6db&t=eef5922c-08d9-5ba3-7299-8393476594e7&l=8807

Matthias Pohl added a comment - 31/Mar/23 06:58 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47748&view=logs&j=a549b384-c55a-52c0-c451-00e0477ab6db&t=eef5922c-08d9-5ba3-7299-8393476594e7&l=8807

Matthias Pohl added a comment - 31/Mar/23 08:24

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47750&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=8385

Matthias Pohl added a comment - 31/Mar/23 08:24 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47750&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=8385

Anton Kalashnikov added a comment - 06/Apr/23 15:02

In conclusion, I haven't found any deadlocks or other suspicious things. I was able to reproduce it locally it seems it indeed just works pretty slowly on an overloaded machine. So I just increased the timeout. We will see how it will be.

merged to master: 6e95bfaf

Anton Kalashnikov added a comment - 06/Apr/23 15:02 In conclusion, I haven't found any deadlocks or other suspicious things. I was able to reproduce it locally it seems it indeed just works pretty slowly on an overloaded machine. So I just increased the timeout. We will see how it will be. merged to master: 6e95bfaf

Sergey Nuyanzin added a comment - 26/May/23 08:17

Merged to 1.16 as 141b47a80092134f95dfdf6b3d0e7051d4fec6bb

Sergey Nuyanzin added a comment - 26/May/23 08:17 Merged to 1.16 as 141b47a80092134f95dfdf6b3d0e7051d4fec6bb

Comment