Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9891

Flink cluster is not shutdown in YARN mode when Flink client is stopped

    XMLWordPrintableJSON

Details

    Description

      We are not using session mode and detached mode. The command to run Flink job on YARN is:

      <flink-1.5.1>/bin/flink run -m yarn-cluster -yn 1 -yqu flink -yjm 768 -ytm 2048 -j ./flink-quickstart-java-1.0-SNAPSHOT.jar -c org.test.WordCount
      

      Flink CLI logs:

      Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/opt/flink-streaming/flink-streaming-1.5.1-1.5.1-bin-hadoop27-scala_2.11-1531485329/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.10-1/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
      2018-07-18 12:47:03,747 INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://hmaster-1.ipbl.rgcloud.net:8188/ws/v1/timeline/
      2018-07-18 12:47:04,222 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
      2018-07-18 12:47:04,222 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
      2018-07-18 12:47:04,248 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set. The Flink YARN Client needs one of these to be set to properly load the Hadoop configuration for accessing YARN.
      2018-07-18 12:47:04,409 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=768, taskManagerMemoryMB=2048, numberTaskManagers=1, slotsPerTaskManager=1}
      2018-07-18 12:47:04,783 WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
      2018-07-18 12:47:04,788 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The configuration directory ('/opt/flink-streaming/flink-streaming-1.5.1-1.5.1-bin-hadoop27-scala_2.11-1531485329/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
      2018-07-18 12:47:07,846 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Submitting application master application_1531474158783_10814
      2018-07-18 12:47:08,073 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1531474158783_10814
      2018-07-18 12:47:08,074 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting for the cluster to be allocated
      2018-07-18 12:47:08,076 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying cluster, current state ACCEPTED
      2018-07-18 12:47:12,864 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - YARN application has been deployed successfully.
      

      Job Manager logs:

      2018-07-18 12:47:09,913 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
      2018-07-18 12:47:09,915 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint (Version: 1.5.1, Rev:3488f8b, Date:10.07.2018 @ 11:51:27 GMT)
      ...
      

      Issues:

      1. Flink job is running as a Flink session
      2. Ctrl+C or 'stop' doesn't stop a job and YARN cluster
      3. Cancel job via Job Maanager web ui doesn't stop Flink cluster. To kill the cluster we need to run: yarn application -kill <id>

      We also tried to run a flink job with 'mode: legacy' and we have the same issues:

      1. Add property 'mode: legacy' to ./conf/flink-conf.yaml
      2. Execute the following command:
      <flink-1.5.1>/bin/flink run -m yarn-cluster -yn 1 -yqu flink -yjm 768 -ytm 2048 -j ./flink-quickstart-java-1.0-SNAPSHOT.jar -c org.test.WordCount
      

      Flink CLI logs:

      Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/opt/flink-streaming/flink-streaming-1.5.1-1.5.1-bin-hadoop27-scala_2.11-1531485329/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.10-1/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
      2018-07-18 16:07:13,820 INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://hmaster-1.ipbl.rgcloud.net:8188/ws/v1/timeline/
      2018-07-18 16:07:14,165 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.LegacyYarnClusterDescriptor to locate the jar
      2018-07-18 16:07:14,165 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.LegacyYarnClusterDescriptor to locate the jar
      2018-07-18 16:07:14,182 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set. The Flink YARN Client needs one of these to be set to properly load the Hadoop configuration for accessing YARN.
      2018-07-18 16:07:14,356 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=768, taskManagerMemoryMB=2048, numberTaskManagers=1, slotsPerTaskManager=1}
      2018-07-18 16:07:14,703 WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
      2018-07-18 16:07:14,708 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The configuration directory ('/home/skrasovs/flink-conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
      2018-07-18 16:07:17,678 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Submitting application master application_1531474158783_10843
      2018-07-18 16:07:17,717 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1531474158783_10843
      2018-07-18 16:07:17,717 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting for the cluster to be allocated
      2018-07-18 16:07:17,720 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying cluster, current state ACCEPTED
      2018-07-18 16:07:23,527 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - YARN application has been deployed successfully.
      Using the parallelism provided by the remote cluster (1). To use another parallelism, set it at the ./bin/flink client.
      Starting execution of program
      2018-07-18 16:07:23,551 INFO org.apache.flink.yarn.YarnClusterClient - Starting program in interactive mode (detached: false)
      

      Job Manager logs:

      2018-07-18 16:07:19,831 INFO org.apache.flink.yarn.YarnApplicationMasterRunner - --------------------------------------------------------------------------------
      2018-07-18 16:07:19,833 INFO org.apache.flink.yarn.YarnApplicationMasterRunner - Starting YARN ApplicationMaster / ResourceManager / JobManager (Version: 1.5.1, Rev:3488f8b, Date:10.07.2018 @ 11:51:27 GMT)
      

      Attachments

        Activity

          People

            azagrebin Andrey Zagrebin
            krasovcheg Sergey Krasovskiy
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: