Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33406

Flink Job failed due to losing connection from ZK server

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.14.3
    • None
    • API / DataStream
    • None
    • Flink version: 1.14.3

      Zookeeper version: 3.4.10

    Description

      We are using Flink 1.14.3 and we faced an issue when losing connection from ZK server, the flink job connecting to the target ZK server will be failed directly. This case can be reproduced 100% when you kill the connected ZK server for simulating connection refused issue. Flink jobs connect to other running ZK server keep running as expected. The log output is:

      org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: java.util.concurrent.TimeoutException
              at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)        
                  at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)        
                  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)        
                  at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812)        
                  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246)        
                  at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054)        
                  at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)        
                  at java.security.AccessController.doPrivileged(Native Method)        
                  at javax.security.auth.Subject.doAs(Subject.java:422)        
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1731)        
                  at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)        
                  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
      Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException        
                  at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)        
                  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)        
                  at org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:123)        
                  at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:80)        
                  at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1916)

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            singularityd Deng Liwen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: