[FLINK-34347] Kubernetes native resource manager request wrong spec. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.18.0, kubernetes-operator-1.6.1
Fix Version/s: None
Component/s: Deployment / Kubernetes, Kubernetes Operator
Labels:
None

Description

We had a flink spec in which TM cpu is set to 0.5, then we upgraded it to 4.0. We found the job manager requesting both TM with 0.5 CPU and 4 CPU. Most TMs with 0.5 CPU was released soon, however there was 1 TM with 0.5 CPU remained and caused lag in job.

Logs for mixed TM requests:

2024-02-03 10:10:41,414 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker octopus-16-323-octopus-engine-write-proxy-taskmanager-3-244 with resource spec WorkerResourceSpec {cpuCores=4.0, taskHeapSize=5.637gb (6053219520 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}.02-03 18:10:44.8442024-02-03 10:10:44,844 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}, current pending count: 1.02-03 18:10:44.9202024-02-03 10:10:44,920 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}, current pending count: 2.02-03 18:10:44.942

The name of wrong TM: octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.

Relevant logs are attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jobmanager.csv
04/Feb/24 03:28
12.75 MB
Ruibin Xing
taskmanager_octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.csv
04/Feb/24 03:28
2.24 MB
Ruibin Xing

Activity

People

Assignee:: Unassigned

Reporter:: Ruibin Xing

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Feb/24 03:28

Updated:: 04/Feb/24 03:28