[FLINK-30878] KubernetesHighAvailabilityRecoverFromSavepointITCase fails due to a deadlock - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.17.0, 1.15.4, 1.16.2
Fix Version/s: 1.17.0, 1.15.4, 1.16.2
Component/s: Deployment / Kubernetes, Runtime / Coordination
Labels:
- pull-request-available
- test-stability

Description

We're seeing a test failure in KubernetesHighAvailabilityRecoverFromSavepointITCase due to a deadlock:

2023-02-01T18:53:35.5540322Z "ForkJoinPool-1-worker-1" #14 daemon prio=5 os_prio=0 tid=0x00007f68ecb18000 nid=0x43dd1 waiting on condition [0x00007f68c1711000]
2023-02-01T18:53:35.5540900Z    java.lang.Thread.State: TIMED_WAITING (parking)
2023-02-01T18:53:35.5541272Z 	at sun.misc.Unsafe.park(Native Method)
2023-02-01T18:53:35.5541932Z 	- parking to wait for  <0x00000000d14d7b60> (a java.util.concurrent.CompletableFuture$Signaller)
2023-02-01T18:53:35.5542496Z 	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
2023-02-01T18:53:35.5543088Z 	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
2023-02-01T18:53:35.5543672Z 	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
2023-02-01T18:53:35.5544240Z 	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
2023-02-01T18:53:35.5544801Z 	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
2023-02-01T18:53:35.5545632Z 	at org.apache.flink.kubernetes.highavailability.KubernetesHighAvailabilityRecoverFromSavepointITCase.testRecoverFromSavepoint(KubernetesHighAvailabilityRecoverFromSavepointITCase.java:113)
2023-02-01T18:53:35.5546409Z 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45565&view=logs&j=bea52777-eaf8-5663-8482-18fbc3630e81&t=b2642e3a-5b86-574d-4c8a-f7e2842bfb14&l=61916

The build failure happens on 1.16. I'm adding 1.17 and 1.15 as fixVersions as well because it might be due to some recent changes which were introduced with ~~FLINK-30462~~ and/or ~~FLINK-30474~~

Attachments

Issue Links

is caused by

FLINK-30474 DefaultMultipleComponentLeaderElectionService triggers HA backend change even if it's not the leader

Closed

links to

GitHub Pull Request #21830

GitHub Pull Request #21831

GitHub Pull Request #21832

Activity

People

Assignee:: Matthias Pohl

Reporter:: Matthias Pohl

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Feb/23 08:30

Updated:: 02/Feb/23 11:03

Resolved:: 02/Feb/23 11:03