Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Not A Problem
-
1.11.0
-
None
Description
Saw this failure in jobmanager startup. I know the exception said that taskmanager.memory.process.size is misconfigured, which is a bug in our end. The bug wasn't discovered because taskmanager.memory.process.size was not required by jobmanager before 1.11.
But I am wondering why is this required by jobmanager for session cluster mode. When taskmanager registering with jobmanager, it reports the resources (like CPU, memory etc.). BTW, we set it properly at taskmanager side in `flink-conf.yaml`.
2020-06-17 18:06:25,079 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [main] - Could not start cluster entrypoint TitusSessionClusterEntrypoint. org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint TitusSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:187) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:516) at com.netflix.spaas.runtime.TitusSessionClusterEntrypoint.main(TitusSessionClusterEntrypoint.java:103) Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent. at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:255) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168) ... 2 more Caused by: org.apache.flink.configuration.IllegalConfigurationException: Cannot read memory size from config option 'taskmanager.memory.process.size'. at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:234) at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.deriveProcessSpecWithTotalProcessMemory(ProcessMemoryUtils.java:100) at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.memoryProcessSpecFromConfig(ProcessMemoryUtils.java:79) at org.apache.flink.runtime.clusterframework.TaskExecutorProcessUtils.processSpecFromConfig(TaskExecutorProcessUtils.java:109) at org.apache.flink.runtime.clusterframework.TaskExecutorProcessSpecBuilder.build(TaskExecutorProcessSpecBuilder.java:58) at org.apache.flink.runtime.resourcemanager.WorkerResourceSpecFactory.workerResourceSpecFromConfigAndCpu(WorkerResourceSpecFactory.java:37) at com.netflix.spaas.runtime.resourcemanager.TitusWorkerResourceSpecFactory.createDefaultWorkerResourceSpec(TitusWorkerResourceSpecFactory.java:17) at org.apache.flink.runtime.resourcemanager.ResourceManagerRuntimeServicesConfiguration.fromConfiguration(ResourceManagerRuntimeServicesConfiguration.java:67) at com.netflix.spaas.runtime.resourcemanager.TitusResourceManagerFactory.createResourceManager(TitusResourceManagerFactory.java:53) at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:167) ... 9 more Caused by: java.lang.IllegalArgumentException: Could not parse value '7500}' for key 'taskmanager.memory.process.size'. at org.apache.flink.configuration.Configuration.getOptional(Configuration.java:753) at org.apache.flink.configuration.Configuration.get(Configuration.java:738) at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:232) ... 18 more Caused by: java.lang.IllegalArgumentException: Memory size unit '}' does not match any of the recognized units: (b | bytes) / (k | kb | kibibytes) / (m | mb | mebibytes) / (g | gb | gibibytes) / (t | tb | tebibytes) at org.apache.flink.configuration.MemorySize.parseUnit(MemorySize.java:331) at org.apache.flink.configuration.MemorySize.parseBytes(MemorySize.java:306) at org.apache.flink.configuration.MemorySize.parse(MemorySize.java:247) at org.apache.flink.configuration.Configuration.convertToMemorySize(Configuration.java:951) at org.apache.flink.configuration.Configuration.convertValue(Configuration.java:885) at org.apache.flink.configuration.Configuration.lambda$getOptional$2(Configuration.java:750) at java.util.Optional.map(Optional.java:215) at org.apache.flink.configuration.Configuration.getOptional(Configuration.java:750) ... 20 more
We extend from WorkerResourceSpecFactory similar to KubernetesWorkerResourceSpecFactory.
public class TitusWorkerResourceSpecFactory extends WorkerResourceSpecFactory { public static final TitusWorkerResourceSpecFactory INSTANCE = new TitusWorkerResourceSpecFactory(); @Override public WorkerResourceSpec createDefaultWorkerResourceSpec(Configuration configuration) { return workerResourceSpecFromConfigAndCpu(configuration, getDefaultCpus(configuration)); } @VisibleForTesting static CPUResource getDefaultCpus(Configuration configuration) { double fallback = Double.valueOf(System.getenv("TITUS_NUM_CPU")); return TaskExecutorProcessUtils.getCpuCoresWithFallback(configuration, fallback); } }