[FLINK-20138] Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Abandoned
Affects Version/s: None
Fix Version/s: None
Component/s: Deployment / YARN, Table SQL / Runtime
Labels:
- stale-major
Environment:

flink : 1.9.2
hadoop :2.7.2
jdk:1.8

Description

our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots.

*SlotPoolImp always did not connect ResourceManager *
```

2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId

{456c9daa6670a4490810f8e51f495174}

]

```

*1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment.
2.The node of Zookeeper is also showed at attachment .*

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2820F7EE-85F9-441D-95D5-8163FB6267DF.png
13/Nov/20 03:30
119 kB
wgcn
jobmanager.log
13/Nov/20 03:31
4.74 MB
wgcn
zk_resource_address_info.png
14/Nov/20 08:08
132 kB
wgcn

Activity

People

Assignee:: Unassigned

Reporter:: wgcn

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Nov/20 03:30

Updated:: 22/Apr/21 16:29

Resolved:: 22/Apr/21 16:29