Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-18350

[1.11.0] jobmanager requires taskmanager.memory.process.size config

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Not A Problem
    • Affects Version/s: 1.11.0
    • Fix Version/s: 1.11.0
    • Labels:
      None

      Description

       

      Saw this failure in jobmanager startup. I know the exception said that taskmanager.memory.process.size is misconfigured, which is a bug in our end. The bug wasn't discovered because taskmanager.memory.process.size was not required by jobmanager before 1.11.

      But I am wondering why is this required by jobmanager for session cluster mode. When taskmanager registering with jobmanager, it reports the resources (like CPU, memory etc.).  BTW, we set it properly at taskmanager side in `flink-conf.yaml`.

      2020-06-17 18:06:25,079 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [main]  - Could not start cluster entrypoint TitusSessionClusterEntrypoint.
      org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint TitusSessionClusterEntrypoint.
      	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:187)
      	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:516)
      	at com.netflix.spaas.runtime.TitusSessionClusterEntrypoint.main(TitusSessionClusterEntrypoint.java:103)
      Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
      	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:255)
      	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216)
      	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
      	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
      	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
      	... 2 more
      Caused by: org.apache.flink.configuration.IllegalConfigurationException: Cannot read memory size from config option 'taskmanager.memory.process.size'.
      	at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:234)
      	at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.deriveProcessSpecWithTotalProcessMemory(ProcessMemoryUtils.java:100)
      	at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.memoryProcessSpecFromConfig(ProcessMemoryUtils.java:79)
      	at org.apache.flink.runtime.clusterframework.TaskExecutorProcessUtils.processSpecFromConfig(TaskExecutorProcessUtils.java:109)
      	at org.apache.flink.runtime.clusterframework.TaskExecutorProcessSpecBuilder.build(TaskExecutorProcessSpecBuilder.java:58)
      	at org.apache.flink.runtime.resourcemanager.WorkerResourceSpecFactory.workerResourceSpecFromConfigAndCpu(WorkerResourceSpecFactory.java:37)
      	at com.netflix.spaas.runtime.resourcemanager.TitusWorkerResourceSpecFactory.createDefaultWorkerResourceSpec(TitusWorkerResourceSpecFactory.java:17)
      	at org.apache.flink.runtime.resourcemanager.ResourceManagerRuntimeServicesConfiguration.fromConfiguration(ResourceManagerRuntimeServicesConfiguration.java:67)
      	at com.netflix.spaas.runtime.resourcemanager.TitusResourceManagerFactory.createResourceManager(TitusResourceManagerFactory.java:53)
      	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:167)
      	... 9 more
      Caused by: java.lang.IllegalArgumentException: Could not parse value '7500}' for key 'taskmanager.memory.process.size'.
      	at org.apache.flink.configuration.Configuration.getOptional(Configuration.java:753)
      	at org.apache.flink.configuration.Configuration.get(Configuration.java:738)
      	at org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils.getMemorySizeFromConfig(ProcessMemoryUtils.java:232)
      	... 18 more
      Caused by: java.lang.IllegalArgumentException: Memory size unit '}' does not match any of the recognized units: (b | bytes) / (k | kb | kibibytes) / (m | mb | mebibytes) / (g | gb | gibibytes) / (t | tb | tebibytes)
      	at org.apache.flink.configuration.MemorySize.parseUnit(MemorySize.java:331)
      	at org.apache.flink.configuration.MemorySize.parseBytes(MemorySize.java:306)
      	at org.apache.flink.configuration.MemorySize.parse(MemorySize.java:247)
      	at org.apache.flink.configuration.Configuration.convertToMemorySize(Configuration.java:951)
      	at org.apache.flink.configuration.Configuration.convertValue(Configuration.java:885)
      	at org.apache.flink.configuration.Configuration.lambda$getOptional$2(Configuration.java:750)
      	at java.util.Optional.map(Optional.java:215)
      	at org.apache.flink.configuration.Configuration.getOptional(Configuration.java:750)
      	... 20 more
      

      We extend from WorkerResourceSpecFactory similar to KubernetesWorkerResourceSpecFactory.

      public class TitusWorkerResourceSpecFactory extends WorkerResourceSpecFactory {
      
        public static final TitusWorkerResourceSpecFactory INSTANCE =
            new TitusWorkerResourceSpecFactory();
      
        @Override
        public WorkerResourceSpec createDefaultWorkerResourceSpec(Configuration configuration) {
          return workerResourceSpecFromConfigAndCpu(configuration, getDefaultCpus(configuration));
        }
      
        @VisibleForTesting
        static CPUResource getDefaultCpus(Configuration configuration) {
          double fallback = Double.valueOf(System.getenv("TITUS_NUM_CPU"));
          return TaskExecutorProcessUtils.getCpuCoresWithFallback(configuration, fallback);
        }
      }
      

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              stevenz3wu Steven Zhen Wu
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: