Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
2.6.0
-
None
-
None
Description
I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA enabled but when I test it out by killing the active RM it brings down the entire cluster.
I have configured Flink's HA in flink-conf.yml.
When I try to kill the active RM using kill -9, YARN correctly switches to the standby RM and I can see applications as ACCEPTED for a minute but soon the standby RM crashes throwing the following exception:
2020-10-18 15:39:36.112 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702) at java.lang.Thread.run(Thread.java:745)
I found some code about submitting high-availability jobs in flink project:
private void activateHighAvailabilitySupport(ApplicationSubmissionContext appContext) throws InvocationTargetException, IllegalAccessException { ApplicationSubmissionContextReflector reflector = ApplicationSubmissionContextReflector.getInstance(); reflector.setKeepContainersAcrossApplicationAttempts(appContext, true); reflector.setAttemptFailuresValidityInterval( appContext, flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL)); }
Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.
Some properties in yarn-site.xml:
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
<value>false</value>
</property>