[FLINK-24919] UnalignedCheckpointITCase hangs on Azure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.15.0
Fix Version/s: 1.13.6, 1.14.3, 1.15.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26304&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=13067

Nov 10 16:13:03 Starting org.apache.flink.test.checkpointing.UnalignedCheckpointITCase#execute[pipeline with mixed channels, p = 20, timeout = 0, buffersPerChannel = 1].

From the log, we can see this case hangs. I guess this seems a new issue which is different from the one reported in this ticket. From the stack, it seems there is something wrong with the checkpoint coordinator, the following thread locked 0x0000000087db4fb8:

2021-11-10T17:14:21.0899474Z Nov 10 17:14:21 "jobmanager-io-thread-2" #12984 daemon prio=5 os_prio=0 tid=0x00007f12e000b800 nid=0x3fb6 runnable [0x00007f0fcd6d4000]
2021-11-10T17:14:21.0899924Z Nov 10 17:14:21    java.lang.Thread.State: RUNNABLE
2021-11-10T17:14:21.0900300Z Nov 10 17:14:21 	at java.util.HashMap$TreeNode.balanceDeletion(HashMap.java:2338)
2021-11-10T17:14:21.0900745Z Nov 10 17:14:21 	at java.util.HashMap$TreeNode.removeTreeNode(HashMap.java:2112)
2021-11-10T17:14:21.0901146Z Nov 10 17:14:21 	at java.util.HashMap.removeNode(HashMap.java:840)
2021-11-10T17:14:21.0901577Z Nov 10 17:14:21 	at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:301)
2021-11-10T17:14:21.0902002Z Nov 10 17:14:21 	at java.util.HashMap.putVal(HashMap.java:664)
2021-11-10T17:14:21.0902531Z Nov 10 17:14:21 	at java.util.HashMap.putMapEntries(HashMap.java:515)
2021-11-10T17:14:21.0902931Z Nov 10 17:14:21 	at java.util.HashMap.putAll(HashMap.java:785)
2021-11-10T17:14:21.0903429Z Nov 10 17:14:21 	at org.apache.flink.runtime.checkpoint.ExecutionAttemptMappingProvider.getVertex(ExecutionAttemptMappingProvider.java:60)
2021-11-10T17:14:21.0904060Z Nov 10 17:14:21 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.reportStats(CheckpointCoordinator.java:1867)
2021-11-10T17:14:21.0904686Z Nov 10 17:14:21 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1152)
2021-11-10T17:14:21.0905372Z Nov 10 17:14:21 	- locked <0x0000000087db4fb8> (a java.lang.Object)
2021-11-10T17:14:21.0905895Z Nov 10 17:14:21 	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
2021-11-10T17:14:21.0906493Z Nov 10 17:14:21 	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1368/705813936.accept(Unknown Source)
2021-11-10T17:14:21.0907086Z Nov 10 17:14:21 	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
2021-11-10T17:14:21.0907698Z Nov 10 17:14:21 	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1369/1447418658.run(Unknown Source)
2021-11-10T17:14:21.0908210Z Nov 10 17:14:21 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2021-11-10T17:14:21.0908735Z Nov 10 17:14:21 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2021-11-10T17:14:21.0909333Z Nov 10 17:14:21 	at java.lang.Thread.run(Thread.java:748)

But other thread is waiting for the lock. I am not familiar with these logics and not sure if this is in the right state. Could anyone who is familiar with these logics take a look?

BTW, concurrent access of HashMap may cause infinite loop，I see in the stack that there are multiple threads are accessing HashMap, though I am not sure if they are the same instance.

Attachments

Issue Links

split from

FLINK-23466 UnalignedCheckpointITCase hangs on Azure

Closed

links to

GitHub Pull Request #17946

GitHub Pull Request #17992

GitHub Pull Request #17993

Activity

People

Assignee:: Anton Kalashnikov

Reporter:: Piotr Nowojski

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Nov/21 08:36

Updated:: 15/Dec/21 01:44

Resolved:: 02/Dec/21 16:02