Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6875

Remote DataSet API job submission timing out

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.1
    • Component/s: DataSet API
    • Labels:
      None

      Description

      When trying to submit a DataSet API job from a remote environment, Flink times out. This works well in 1.2.1 and seems to be broken in 1.3.0.

      The following program reproduces the issue:

      Example
      package com.test;
      
      import org.apache.flink.api.java.DataSet;
      import org.apache.flink.api.java.ExecutionEnvironment;
      
      import java.util.Date;
      
      public class FlinkRemoteIssue {
      
          public static void main(String[] args) throws Exception {
      
              String host = "192.168.1.235";
              int port = 6123;
              String[] jars = {
                      "c:\\tmp\\FlinkRemoteIssue-all-1.0-SNAPSHOT.jar"
              };
              ExecutionEnvironment env = ExecutionEnvironment.createRemoteEnvironment(host, port, jars);
      
              DataSet<String> pipe = env.fromElements("1");
              pipe.map( (oneString) -> {
                  System.err.println("Map executing: " + new Date());
                  return "Map result: " + new Date();
              }).writeAsText("/tmp/lixo-" + System.currentTimeMillis());
      
              env.execute("Flink Remote Issue");
          }
      }
      

      Result from running program (running inside IntelliJ):

      Submitting job with JobID: 9f96638f014a87783cecd54b61c55d9a. Waiting for job completion.
      Connected to JobManager at Actor[akka.tcp://flink@10.97.120.139:6123/user/jobmanager#1432447220] with leader session id 00000000-0000-0000-0000-000000000000.
      Exception in thread "main" org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
      	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:478)
      	at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
      	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)
      	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:429)
      	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
      	at org.apache.flink.client.RemoteExecutor.executePlanWithJars(RemoteExecutor.java:211)
      	at org.apache.flink.client.RemoteExecutor.executePlan(RemoteExecutor.java:188)
      	at org.apache.flink.api.java.RemoteEnvironment.execute(RemoteEnvironment.java:172)
      	at com.test.FlinkRemoteIssue.main(FlinkRemoteIssue.java:25)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
      Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
      	at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:309)
      	at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:396)
      	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:467)
      	... 13 more
      Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.
      	at org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:119)
      	at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:251)
      	at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:89)
      	at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68)
      	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
      	at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
      	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
      	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
      	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
      	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
      	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
      	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
      	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      
      Process finished with exit code 1
      

      Message in JobManager log:

      2017-06-08 10:57:03,310 WARN  org.apache.flink.runtime.jobmanager.JobManager                - Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId: 4d414efd050a871863f3319a8c56781c),EXECUTION_RESULT_AND_STATE_CHANGES)) because the expected leader session ID None did not equal the received leader session ID Some(00000000-0000-0000-0000-000000000000).
      

        Activity

        Hide
        fassisrosa@hotmail.com Francisco Rosa added a comment -

        I believe this was a versioning issue. Version on server did not match client version. Closing.

        Show
        fassisrosa@hotmail.com Francisco Rosa added a comment - I believe this was a versioning issue. Version on server did not match client version. Closing.
        Hide
        hetang hetang added a comment -

        十月 26, 2017 7:51:15 下午 org.apache.flink.runtime.blob.BlobClient uploadJarFiles
        信息: Blob client connecting to akka.tcp://flink@hbase-1:50022/user/jobmanager
        十月 26, 2017 7:51:19 下午 org.apache.flink.runtime.client.JobSubmissionClientActor$1 call
        信息: Submit job to the job manager akka.tcp://flink@hbase-1:50022/user/jobmanager.
        十月 26, 2017 7:52:19 下午 org.apache.flink.runtime.client.JobClientActor terminate
        信息: Terminate JobClientActor.
        十月 26, 2017 7:52:19 下午 org.apache.flink.runtime.client.JobClientActor disconnectFromJobManager
        信息: Disconnect from JobManager Actor[akka.tcp://flink@hbase-1:50022/user/jobmanager#1597589120].
        十月 26, 2017 7:52:19 下午 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3 apply$mcV$sp
        信息: Shutting down remote daemon.
        十月 26, 2017 7:52:19 下午 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3 apply$mcV$sp
        信息: Remote daemon shut down; proceeding with flushing remote transports.
        十月 26, 2017 7:52:19 下午 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3 apply$mcV$sp
        信息: Remoting shut down.
        十月 26, 2017 7:52:19 下午 org.apache.beam.runners.flink.FlinkRunner run
        严重: Pipeline execution failed
        org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:478)
        at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:429)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
        at org.apache.flink.client.RemoteExecutor.executePlanWithJars(RemoteExecutor.java:211)
        at org.apache.flink.client.RemoteExecutor.executePlan(RemoteExecutor.java:188)
        at org.apache.flink.api.java.RemoteEnvironment.execute(RemoteEnvironment.java:172)
        at org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.executePipeline(FlinkPipelineExecutionEnvironment.java:114)
        at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:118)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
        at org.apache.beam.examples.WordCount.main(WordCount.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
        at java.lang.Thread.run(Thread.java:748)
        Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:309)
        at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:396)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:467)
        ... 18 more
        Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.
        at org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:119)
        at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:251)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:89)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

        Show
        hetang hetang added a comment - 十月 26, 2017 7:51:15 下午 org.apache.flink.runtime.blob.BlobClient uploadJarFiles 信息: Blob client connecting to akka.tcp://flink@hbase-1:50022/user/jobmanager 十月 26, 2017 7:51:19 下午 org.apache.flink.runtime.client.JobSubmissionClientActor$1 call 信息: Submit job to the job manager akka.tcp://flink@hbase-1:50022/user/jobmanager. 十月 26, 2017 7:52:19 下午 org.apache.flink.runtime.client.JobClientActor terminate 信息: Terminate JobClientActor. 十月 26, 2017 7:52:19 下午 org.apache.flink.runtime.client.JobClientActor disconnectFromJobManager 信息: Disconnect from JobManager Actor [akka.tcp://flink@hbase-1:50022/user/jobmanager#1597589120] . 十月 26, 2017 7:52:19 下午 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3 apply$mcV$sp 信息: Shutting down remote daemon. 十月 26, 2017 7:52:19 下午 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3 apply$mcV$sp 信息: Remote daemon shut down; proceeding with flushing remote transports. 十月 26, 2017 7:52:19 下午 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3 apply$mcV$sp 信息: Remoting shut down. 十月 26, 2017 7:52:19 下午 org.apache.beam.runners.flink.FlinkRunner run 严重: Pipeline execution failed org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager. at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:478) at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:429) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404) at org.apache.flink.client.RemoteExecutor.executePlanWithJars(RemoteExecutor.java:211) at org.apache.flink.client.RemoteExecutor.executePlan(RemoteExecutor.java:188) at org.apache.flink.api.java.RemoteEnvironment.execute(RemoteEnvironment.java:172) at org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.executePipeline(FlinkPipelineExecutionEnvironment.java:114) at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:118) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283) at org.apache.beam.examples.WordCount.main(WordCount.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager. at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:309) at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:396) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:467) ... 18 more Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission. at org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:119) at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:251) at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:89) at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

          People

          • Assignee:
            Unassigned
            Reporter:
            fassisrosa@hotmail.com Francisco Rosa
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development