Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Won't Fix
-
2.6.0, 2.7.1
-
None
-
None
-
RM HA with ATS
Description
1.Start RM with HA and ATS configured and run some yarn applications
2.Once applications are finished sucessfully start timeline server
3.Now failover HA form active to standby or restart the node
ATS events for the applications already existing in ATS are resent which is not required.
Attachments
Attachments
- YARN-3127.20150213-1.patch
- 8 kB
- Naganarasimha G R
- YARN-3127.20150329-1.patch
- 11 kB
- Naganarasimha G R
- AppTransition.png
- 182 kB
- Naganarasimha G R
- YARN-3127.20150624-1.patch
- 15 kB
- Naganarasimha G R
- YARN-3127.20151123-1.patch
- 15 kB
- Naganarasimha G R
Issue Links
Activity
This case is the same as the Timeline service starting long (days) after the application has finished. I think it is better not to store these events during recovery and instead simply report an error on the UI saying that the Timeline service doesn't know about this application.
Thanks a lot Vinod and Naga for looking into the issue.
Vinod do you suggest the below part also should be gracefully handled (attempt details not available) and not publish event to ATS during recovery?
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore
public ApplicationAttemptReport getApplicationAttempt(
ApplicationAttemptId appAttemptId) throws YarnException, IOException {
ApplicationReportExt app = getApplication(
appAttemptId.getApplicationId(), ApplicationReportField.USER_AND_ACLS);
checkAccess(app);
TimelineEntity entity = timelineDataManager.getEntity(
AppAttemptMetricsConstants.ENTITY_TYPE,
appAttemptId.toString(), EnumSet.allOf(Field.class),
UserGroupInformation.getLoginUser());
if (entity == null)Unknown macro: { throw new ApplicationAttemptNotFoundException( "The entity for application attempt " + appAttemptId + " doesn't exist in the timeline store"); }else
Unknown macro: { return convertToApplicationAttemptReport(entity); }}
Please do correct me if i am wrong.
Attaching initial patch to avoid events sent to System metrics publisher during RM application recovery from state store
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12698538/YARN-3127.20150213-1.patch
against trunk revision 6f5290b.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6624//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6624//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6624//console
This message is automatically generated.
None of the finds bugs issue reports are related to the code changes which is present in the patch.
Hi Zhijie Shen, Vinod Kumar Vavilapalli & Xuan Gong,
Can any one of you review this jira , please
Naganarasimha G R Thank you for taking this issue! The policy of fix looks good to me. Could you add a test case to TestRMRestart to cover the case?
Also, can we preserve following test cases?
- verify(writer).applicationStarted(any(RMApp.class)); - verify(publisher).appCreated(any(RMApp.class), anyLong());
Thanks Tsuyoshi Ozawa for the review and sorry for the delay in the response as i was held up in other issues ...
Could you add a test case to TestRMRestart to cover the case?
Have taken care in this updated patch
can we preserve following test cases?
As there are changes in the transitions if these methods were there, then TestRMAppTransitions was failing for multiple testcases. Approach adopted to fix this issue is : earlier SystemMetricPublisher.appCreated() was invoked during creation of RMAppImpl itself and also SystemMetricPublisher.ACLsUpdated was invoked in RMAppManager.createAndPopulateNewRMApp which was common to both recover and new application execution flow, so I have removed from the above mentioned places and placed it in RMAppManager.publishSystemMetrics thus ensuring that only during new application execution flow these updates are sent to SystemMetricPublisher
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12708037/YARN-3127.20150329-1.patch
against trunk revision 3d9132d.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
org.apache.hadoop.yarn.server.resourcemanager.TestRMHA
org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7141//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7141//console
This message is automatically generated.
Hi Tsuyoshi Ozawa,
Seems like the test cases are mostly failing to the bug which has been induced in HADOOP-10670 and its taken care in HADOOP-11754, and some cases are failing due bind exception but not sure they are related to the changes done in the patch guessing might be the impacts of HADOOP-10670 itself (also earlier almost the same patch all test cases passed and for the new test case not starting new RM as such, so less likely related to these changes). In general you can review the patch and will trigger jenkins once HADOOP-11754 is in.
HADOOP-11754 has been fixed hence uploading the same patch to check for test case failures
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12708571/YARN-3127.20150329-1.patch
against trunk revision 2daa478.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7183//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7183//console
This message is automatically generated.
TestFairScheduler is not related to this issue and is getting fixed as part of YARN-2666. Tsuyoshi Ozawa, can you please take a look at this now ?
Hi Xuan Gong, If you have the bandwidth can you take a look at this patch too ?
Naganarasimha G R Thanks for working on this. I will take a look shortly.
Naganarasimha G R Sorry for the late reply.
So, the solution here is to avoid events sent to System metrics publisher during RM application recovery from state store. It looks fine to solve the current issue.
But here is the case I am thinking right now might not work:
- we start RM, ATS correctly
- the RM failover/restart happens between the transition from FINAL_SAVING to FINISHED
- based on the original code, when we do the recovery for the applications, we will send out appFinished event to System metrics publisher to update the app status in ATS
- but based on the patch, we will not do it. In this case, the ATS will never get the app status update(change the app status from start to finished) ? This looks like an issue which is broken by the patch.
Did I miss anything ?
Thanks for the review Xuan Gong, Good catch... Well i am not sure why State is saved statestore in the the FINAL_SAVING, can we move it to FinalTransition, i.e. we can do it after we publish the event to the publisher then we can store the state in RMStatestore or vice versa ?
Also IMO depending on Where the RM will failover(killed/stopped) there can be chances that Entities are published to ATS after fail over. so would it be good to handle in the ATS side such that URL crash doesn't happen?
Hi Naganarasimha G R, any updates on this JIRA? As pointed out by Xuan Gong, the current solution seems to have some problems. Therefore, I'm canceling this patch for now. Thanks!
Thanks for reviewing Li Lu,
Issue mentioned over here main cause is already addressed in another jira by Xuan Gong and but when we test in this way we still get to see null in the webui and also more importantly this jira addressing is required as events are published for every app (start and finished) on RM failover. So if 10000 apps are maintained then so many additional non required events are getting triggered. this we need to address. And for the issue pointed by Xuan Gong, i had asked for suggestion of approach being taken and hence waiting for it, AFAIK we need to ensure first ATS events are sent and then store the final application state to RMstate store in FINAL_SAVING transition (and also other possible cases where app is created and will be killed b4 attempt is created in which case FINAL_SAVING is not called). If this approach is fine then will update the patch and the description.
Thanks Xuan Gong for pointing out this jira,
I was actually planning to change the description and scope of this jira. As what i am trying to solve here is stop unwanted timeline events which gets triggered on RM failover. If we observe even for the finished apps we get most of timeline events generated currently which is not needed as RM by default stores around 10000 apps. Also when i recently test the scenario mentioned above seems like it is already been corrected (before YARN-3701) but some nulls are displayed in the webui. Little Held up, will try to modify the jira and rework on the patch based on your comments (handling sending of events on finished) asap.
Hi Xuan Gong,
I have modified the patch to work for the scenario you mentioned but in best effort basis it will try to avoid duplicated publish, such that events are published b4 saving it to statestore (failover happens after publishing and b4 saving to state store might result in multiple events published).
Based on state transition diagram, All the events are going through the final_saving state except for
New -> Finished (on RECOVER event)
New -> Failed (on RECOVER event)
New -> Killed (on KILL,RECOVER event)
Killing -> Finished (on ATTEMPT_FINSHED event)
running -> Finished (on ATTEMPT_FINSHED event)
first 2, No need to handle as the state would be published ATS b4 recovery.
for the 3rd one when Application is killed from New state then we need to explicitly publish
and also the last 2 state transitions needs to be handled which doesn't go through final_saving state.
Please review...
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | pre-patch | 16m 11s | Pre-patch trunk compilation is healthy. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | tests included | 0m 0s | The patch appears to include 3 new or modified test files. |
+1 | javac | 7m 47s | There were no new javac warning messages. |
+1 | javadoc | 9m 39s | There were no new javadoc warning messages. |
+1 | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. |
-1 | checkstyle | 0m 49s | The applied patch generated 1 new checkstyle issues (total was 150, now 151). |
-1 | whitespace | 0m 1s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. |
+1 | install | 1m 20s | mvn install still works. |
+1 | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. |
+1 | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. |
+1 | yarn tests | 51m 55s | Tests passed in hadoop-yarn-server-resourcemanager. |
90m 6s |
Subsystem | Report/Notes |
---|---|
Patch URL | http://issues.apache.org/jira/secure/attachment/12747076/YARN-3127.20150624-1.patch |
Optional Tests | javadoc javac unit findbugs checkstyle |
git revision | trunk / f8f6091 |
checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt |
whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/whitespace.txt |
hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8653/testReport/ |
Java | 1.7.0_55 |
uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8653/console |
This message was automatically generated.
Hi Xuan Gong, Li Lu (inactive) & Tsuyoshi Ozawa,
Can any one you have a look at this jira ?
Hi Xuan Gong/ Tsuyoshi Ozawa,
I feel the issue is valid and needs to be fixed, if one of you guys can take a look at the approach and the patch i mentioned earlier it would be helpful to get this jira moving.
Hi Sangjin Lee, Rohith Sharma K S & Xuan Gong, i have rebased the patch can you please take a look at it. Based on this we can get YARN-4350 corrected.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 0s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 3 new or modified test files. |
+1 | mvninstall | 7m 48s | trunk passed |
+1 | compile | 0m 29s | trunk passed with JDK v1.8.0_66 |
+1 | compile | 0m 32s | trunk passed with JDK v1.7.0_85 |
+1 | checkstyle | 0m 14s | trunk passed |
+1 | mvnsite | 0m 38s | trunk passed |
+1 | mvneclipse | 0m 15s | trunk passed |
+1 | findbugs | 1m 15s | trunk passed |
+1 | javadoc | 0m 22s | trunk passed with JDK v1.8.0_66 |
+1 | javadoc | 0m 29s | trunk passed with JDK v1.7.0_85 |
+1 | mvninstall | 0m 35s | the patch passed |
+1 | compile | 0m 29s | the patch passed with JDK v1.8.0_66 |
+1 | javac | 0m 29s | the patch passed |
+1 | compile | 0m 32s | the patch passed with JDK v1.7.0_85 |
-1 | javac | 3m 43s | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85 with JDK v1.7.0_85 generated 1 new issues (was 2, now 2). |
+1 | javac | 0m 32s | the patch passed |
-1 | checkstyle | 0m 14s | Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 147, now 148). |
+1 | mvnsite | 0m 38s | the patch passed |
+1 | mvneclipse | 0m 16s | the patch passed |
+1 | whitespace | 0m 0s | Patch has no whitespace issues. |
+1 | findbugs | 1m 23s | the patch passed |
+1 | javadoc | 0m 23s | the patch passed with JDK v1.8.0_66 |
+1 | javadoc | 0m 28s | the patch passed with JDK v1.7.0_85 |
-1 | unit | 59m 15s | hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. |
-1 | unit | 60m 26s | hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85. |
+1 | asflicense | 0m 23s | Patch does not generate ASF License warnings. |
138m 10s |
Reason | Tests |
---|---|
JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
hadoop.yarn.server.resourcemanager.TestClientRMTokens | |
JDK v1.7.0_85 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
This message was automatically generated.
Seems like test failures are unrelated to the fix and check style is not valid.
Closing this jira based on the discussion in YARN-4392, Conclusion is resending the events during recovery is ok, as there is probability that ATS events are not yet dispatched but RM fails over. But we need to just ensure that the data like event time should not be altered when resending events
In org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.AttemptRecoveredTransition
RMAppImpl.isAppInFinalState returns true hence the transition which publishes the attempt during recovery to ATS is not played.
So one option is to move BaseFinalTransition.transition outside this if block.
But other query which i have is, During recovery whether is it required to publish events to ATS ?