Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10464

Flink job on YARN with HA enabled crashes all RMs on attempt recovery

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.6.0
    • None
    • resourcemanager
    • None

    Description

      I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA enabled but when I test it out by killing the active RM it brings down the entire cluster.
      I have configured Flink's HA in flink-conf.yml.
      When I try to kill the active RM using kill -9, YARN correctly switches to the standby RM and I can see applications as ACCEPTED for a minute but soon the standby RM crashes throwing the following exception:

      2020-10-18 15:39:36.112 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler
       java.lang.NullPointerException
       at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
       at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
       at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
       at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
       at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
       at java.lang.Thread.run(Thread.java:745)

      I found some code about submitting high-availability jobs in flink project:

      private void activateHighAvailabilitySupport(ApplicationSubmissionContext appContext) throws
      			InvocationTargetException, IllegalAccessException {
      
      		ApplicationSubmissionContextReflector reflector = ApplicationSubmissionContextReflector.getInstance();
      		reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
      		reflector.setAttemptFailuresValidityInterval(
      				appContext,
      				flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
      	}
      

      Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.

      Some properties in yarn-site.xml: 

      <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
      </property>

      <property>
        <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
        <value>false</value>
      </property>

      Attachments

        1. YARN-10464.1.patch
          1 kB
          tim yu

        Activity

          People

            Unassigned Unassigned
            yulei0824 tim yu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: