Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.18.0, kubernetes-operator-1.6.1
-
None
-
None
Description
We had a flink spec in which TM cpu is set to 0.5, then we upgraded it to 4.0. We found the job manager requesting both TM with 0.5 CPU and 4 CPU. Most TMs with 0.5 CPU was released soon, however there was 1 TM with 0.5 CPU remained and caused lag in job.
Logs for mixed TM requests:
2024-02-03 10:10:41,414 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker octopus-16-323-octopus-engine-write-proxy-taskmanager-3-244 with resource spec WorkerResourceSpec {cpuCores=4.0, taskHeapSize=5.637gb (6053219520 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}.02-03 18:10:44.8442024-02-03 10:10:44,844 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}, current pending count: 1.02-03 18:10:44.9202024-02-03 10:10:44,920 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}, current pending count: 2.02-03 18:10:44.942
The name of wrong TM: octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.
Relevant logs are attached.