Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24962

The root cause of a failed job was hidden in certain cases

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 1.14.0
    • None
    • API / Python
    • None

    Description

      For the following job:

      from pyflink.common.typeinfo import Types
      from pyflink.datastream import StreamExecutionEnvironment, DataStream
      from pyflink.table import StreamTableEnvironment
      
      
      def state_access_demo():
          env = StreamExecutionEnvironment.get_execution_environment()
          t_env = StreamTableEnvironment.create(stream_execution_environment=env)
          sql = """CREATE TABLE kafka_source (
              id int,
              name string
          ) WITH (
              'connector' = 'kafka',
              'topic' = 'test',
              'properties.bootstrap.servers' = '***',
              'properties.group.id' = '***',
              'scan.startup.mode' = 'latest-offset',
              'format' = 'json',
              'json.fail-on-missing-field' = 'false',
              'json.ignore-parse-errors' = 'true'
          )"""
          t_env.execute_sql(sql)
          table = t_env.from_path("kafka_source")
          ds: DataStream = t_env.to_append_stream(table, type_info=Types.ROW([Types.INT(), Types.STRING()]))
          ds.map(lambda a: print(a))
          env.execute('state_access_demo')
      
      if __name__ == '__main__':
          state_access_demo()
      

      It failed with the following exception which doesn't contain any useful information (the root cause is that it should return a value in the map function):

      Caused by: java.lang.RuntimeException: Failed to create stage bundle factory! INFO:root:Initializing Python harness: D:\soft\anaconda\lib\site-packages\pyflink\fn_execution\beam\beam_boot.py --id=2-1 --provision_endpoint=localhost:57201
      INFO:root:Starting up Python harness in loopback mode.
      
          at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.createStageBundleFactory(BeamPythonFunctionRunner.java:566)
          at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.open(BeamPythonFunctionRunner.java:255)
          at org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:131)
          at org.apache.flink.streaming.api.operators.python.AbstractOneInputPythonFunctionOperator.open(AbstractOneInputPythonFunctionOperator.java:116)
          at org.apache.flink.streaming.api.operators.python.PythonProcessOperator.open(PythonProcessOperator.java:59)
          at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:110)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:711)
          at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:687)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:654)
          at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
          at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
          at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
          at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
          at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: Process died with exit code 0
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2050)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache.get(LocalCache.java:3952)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4964)
          at org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory$SimpleStageBundleFactory.<init>(DefaultJobBundleFactory.java:451)
          at org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory$SimpleStageBundleFactory.<init>(DefaultJobBundleFactory.java:436)
          at org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory.forStage(DefaultJobBundleFactory.java:303)
          at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.createStageBundleFactory(BeamPythonFunctionRunner.java:564)
          ... 14 more
      Caused by: java.lang.IllegalStateException: Process died with exit code 0
          at org.apache.beam.runners.fnexecution.environment.ProcessManager$RunningProcess.isAliveOrThrow(ProcessManager.java:75)
          at org.apache.beam.runners.fnexecution.environment.ProcessEnvironmentFactory.createEnvironment(ProcessEnvironmentFactory.java:112)
          at org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory$1.load(DefaultJobBundleFactory.java:252)
          at org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory$1.load(DefaultJobBundleFactory.java:231)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3528)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2277)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
          at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
      

      It's very difficult for users to figure out why the job failed and we should improve this.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dianfu Dian Fu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: