[YARN-3127] Avoid timeline events during RM recovery or restart - ASF JIRA

Naganarasimha G R added a comment - 03/Feb/15 10:54

In org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.AttemptRecoveredTransition

if (rmApp.getCurrentAppAttempt() == appAttempt
&& !RMAppImpl.isAppInFinalState(rmApp))

Unknown macro: { // Add the previous finished attempt to scheduler synchronously so // that scheduler knows the previous attempt. appAttempt.scheduler.handle(new AppAttemptAddedSchedulerEvent( appAttempt.getAppAttemptId(), false, true)); (new BaseFinalTransition(appAttempt.recoveredFinalState)).transition( appAttempt, event); }

RMAppImpl.isAppInFinalState returns true hence the transition which publishes the attempt during recovery to ATS is not played.
So one option is to move BaseFinalTransition.transition outside this if block.
But other query which i have is, During recovery whether is it required to publish events to ATS ?

Naganarasimha G R added a comment - 03/Feb/15 10:54 In org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.AttemptRecoveredTransition if (rmApp.getCurrentAppAttempt() == appAttempt && !RMAppImpl.isAppInFinalState(rmApp)) Unknown macro: { // Add the previous finished attempt to scheduler synchronously so // that scheduler knows the previous attempt. appAttempt.scheduler.handle(new AppAttemptAddedSchedulerEvent( appAttempt.getAppAttemptId(), false, true)); (new BaseFinalTransition(appAttempt.recoveredFinalState)).transition( appAttempt, event); } RMAppImpl.isAppInFinalState returns true hence the transition which publishes the attempt during recovery to ATS is not played. So one option is to move BaseFinalTransition.transition outside this if block. But other query which i have is, During recovery whether is it required to publish events to ATS ?

Vinod Kumar Vavilapalli added a comment - 03/Feb/15 20:13

This case is the same as the Timeline service starting long (days) after the application has finished. I think it is better not to store these events during recovery and instead simply report an error on the UI saying that the Timeline service doesn't know about this application.

Vinod Kumar Vavilapalli added a comment - 03/Feb/15 20:13 This case is the same as the Timeline service starting long (days) after the application has finished. I think it is better not to store these events during recovery and instead simply report an error on the UI saying that the Timeline service doesn't know about this application.

Bibin Chundatt added a comment - 05/Feb/15 17:20

Thanks a lot Vinod and Naga for looking into the issue.
Vinod do you suggest the below part also should be gracefully handled (attempt details not available) and not publish event to ATS during recovery?

org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore

public ApplicationAttemptReport getApplicationAttempt(
ApplicationAttemptId appAttemptId) throws YarnException, IOException {
ApplicationReportExt app = getApplication(
appAttemptId.getApplicationId(), ApplicationReportField.USER_AND_ACLS);
checkAccess(app);
TimelineEntity entity = timelineDataManager.getEntity(
AppAttemptMetricsConstants.ENTITY_TYPE,
appAttemptId.toString(), EnumSet.allOf(Field.class),
UserGroupInformation.getLoginUser());
if (entity == null)

Unknown macro: { throw new ApplicationAttemptNotFoundException( "The entity for application attempt " + appAttemptId + " doesn't exist in the timeline store"); }

else

Unknown macro: { return convertToApplicationAttemptReport(entity); }

}

Please do correct me if i am wrong.

Bibin Chundatt added a comment - 05/Feb/15 17:20 Thanks a lot Vinod and Naga for looking into the issue. Vinod do you suggest the below part also should be gracefully handled (attempt details not available) and not publish event to ATS during recovery? org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore public ApplicationAttemptReport getApplicationAttempt( ApplicationAttemptId appAttemptId) throws YarnException, IOException { ApplicationReportExt app = getApplication( appAttemptId.getApplicationId(), ApplicationReportField.USER_AND_ACLS); checkAccess(app); TimelineEntity entity = timelineDataManager.getEntity( AppAttemptMetricsConstants.ENTITY_TYPE, appAttemptId.toString(), EnumSet.allOf(Field.class), UserGroupInformation.getLoginUser()); if (entity == null) Unknown macro: { throw new ApplicationAttemptNotFoundException( "The entity for application attempt " + appAttemptId + " doesn't exist in the timeline store"); } else Unknown macro: { return convertToApplicationAttemptReport(entity); } } Please do correct me if i am wrong.

Naganarasimha G R added a comment - 12/Feb/15 21:47

Attaching initial patch to avoid events sent to System metrics publisher during RM application recovery from state store

Naganarasimha G R added a comment - 12/Feb/15 21:47 Attaching initial patch to avoid events sent to System metrics publisher during RM application recovery from state store

Hadoop QA added a comment - 12/Feb/15 23:28

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12698538/YARN-3127.20150213-1.patch
against trunk revision 6f5290b.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 2 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6624//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6624//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6624//console

This message is automatically generated.

Hadoop QA added a comment - 12/Feb/15 23:28 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698538/YARN-3127.20150213-1.patch against trunk revision 6f5290b. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6624//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6624//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6624//console This message is automatically generated.

Naganarasimha G R added a comment - 13/Feb/15 08:18

None of the finds bugs issue reports are related to the code changes which is present in the patch.

Naganarasimha G R added a comment - 13/Feb/15 08:18 None of the finds bugs issue reports are related to the code changes which is present in the patch.

Naganarasimha G R added a comment - 17/Mar/15 18:22

Hi Zhijie Shen, Vinod Kumar Vavilapalli & Xuan Gong,
Can any one of you review this jira , please

Naganarasimha G R added a comment - 17/Mar/15 18:22 Hi Zhijie Shen , Vinod Kumar Vavilapalli & Xuan Gong , Can any one of you review this jira , please

Tsuyoshi Ozawa added a comment - 24/Mar/15 16:47

Naganarasimha G R Thank you for taking this issue! The policy of fix looks good to me. Could you add a test case to TestRMRestart to cover the case?

Also, can we preserve following test cases?

-    verify(writer).applicationStarted(any(RMApp.class));
-    verify(publisher).appCreated(any(RMApp.class), anyLong());

Tsuyoshi Ozawa added a comment - 24/Mar/15 16:47 Naganarasimha G R Thank you for taking this issue! The policy of fix looks good to me. Could you add a test case to TestRMRestart to cover the case? Also, can we preserve following test cases? - verify(writer).applicationStarted(any(RMApp.class)); - verify(publisher).appCreated(any(RMApp.class), anyLong());

Naganarasimha G R added a comment - 29/Mar/15 12:19

Thanks Tsuyoshi Ozawa for the review and sorry for the delay in the response as i was held up in other issues ...

Could you add a test case to TestRMRestart to cover the case?

Have taken care in this updated patch

can we preserve following test cases?

As there are changes in the transitions if these methods were there, then TestRMAppTransitions was failing for multiple testcases. Approach adopted to fix this issue is : earlier SystemMetricPublisher.appCreated() was invoked during creation of RMAppImpl itself and also SystemMetricPublisher.ACLsUpdated was invoked in RMAppManager.createAndPopulateNewRMApp which was common to both recover and new application execution flow, so I have removed from the above mentioned places and placed it in RMAppManager.publishSystemMetrics thus ensuring that only during new application execution flow these updates are sent to SystemMetricPublisher

Naganarasimha G R added a comment - 29/Mar/15 12:19 Thanks Tsuyoshi Ozawa for the review and sorry for the delay in the response as i was held up in other issues ... Could you add a test case to TestRMRestart to cover the case? Have taken care in this updated patch can we preserve following test cases? As there are changes in the transitions if these methods were there, then TestRMAppTransitions was failing for multiple testcases. Approach adopted to fix this issue is : earlier SystemMetricPublisher.appCreated() was invoked during creation of RMAppImpl itself and also SystemMetricPublisher.ACLsUpdated was invoked in RMAppManager.createAndPopulateNewRMApp which was common to both recover and new application execution flow, so I have removed from the above mentioned places and placed it in RMAppManager.publishSystemMetrics thus ensuring that only during new application execution flow these updates are sent to SystemMetricPublisher

Hadoop QA added a comment - 29/Mar/15 13:44

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12708037/YARN-3127.20150329-1.patch
against trunk revision 3d9132d.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestRMHA
org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7141//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7141//console

This message is automatically generated.

Hadoop QA added a comment - 29/Mar/15 13:44 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708037/YARN-3127.20150329-1.patch against trunk revision 3d9132d. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMHA org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7141//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7141//console This message is automatically generated.

Naganarasimha G R added a comment - 30/Mar/15 04:26

Hi Tsuyoshi Ozawa,
Seems like the test cases are mostly failing to the bug which has been induced in ~~HADOOP-10670~~ and its taken care in ~~HADOOP-11754~~, and some cases are failing due bind exception but not sure they are related to the changes done in the patch guessing might be the impacts of ~~HADOOP-10670~~ itself (also earlier almost the same patch all test cases passed and for the new test case not starting new RM as such, so less likely related to these changes). In general you can review the patch and will trigger jenkins once ~~HADOOP-11754~~ is in.

Naganarasimha G R added a comment - 30/Mar/15 04:26 Hi Tsuyoshi Ozawa , Seems like the test cases are mostly failing to the bug which has been induced in HADOOP-10670 and its taken care in HADOOP-11754 , and some cases are failing due bind exception but not sure they are related to the changes done in the patch guessing might be the impacts of HADOOP-10670 itself (also earlier almost the same patch all test cases passed and for the new test case not starting new RM as such, so less likely related to these changes). In general you can review the patch and will trigger jenkins once HADOOP-11754 is in.

Naganarasimha G R added a comment - 01/Apr/15 00:33

~~HADOOP-11754~~ has been fixed hence uploading the same patch to check for test case failures

Naganarasimha G R added a comment - 01/Apr/15 00:33 HADOOP-11754 has been fixed hence uploading the same patch to check for test case failures

Hadoop QA added a comment - 01/Apr/15 02:04

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12708571/YARN-3127.20150329-1.patch
against trunk revision 2daa478.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7183//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7183//console

This message is automatically generated.

Hadoop QA added a comment - 01/Apr/15 02:04 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708571/YARN-3127.20150329-1.patch against trunk revision 2daa478. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7183//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7183//console This message is automatically generated.

Naganarasimha G R added a comment - 01/Apr/15 04:50

TestFairScheduler is not related to this issue and is getting fixed as part of ~~YARN-2666~~. Tsuyoshi Ozawa, can you please take a look at this now ?

Naganarasimha G R added a comment - 01/Apr/15 04:50 TestFairScheduler is not related to this issue and is getting fixed as part of YARN-2666 . Tsuyoshi Ozawa , can you please take a look at this now ?

Naganarasimha G R added a comment - 07/Apr/15 15:49

Hi Xuan Gong, If you have the bandwidth can you take a look at this patch too ?

Naganarasimha G R added a comment - 07/Apr/15 15:49 Hi Xuan Gong , If you have the bandwidth can you take a look at this patch too ?

Xuan Gong added a comment - 08/Apr/15 13:36

Naganarasimha G R Thanks for working on this. I will take a look shortly.

Xuan Gong added a comment - 08/Apr/15 13:36 Naganarasimha G R Thanks for working on this. I will take a look shortly.

Xuan Gong added a comment - 13/Apr/15 19:49

Naganarasimha G R Sorry for the late reply.
So, the solution here is to avoid events sent to System metrics publisher during RM application recovery from state store. It looks fine to solve the current issue.

But here is the case I am thinking right now might not work:

we start RM, ATS correctly
the RM failover/restart happens between the transition from FINAL_SAVING to FINISHED
based on the original code, when we do the recovery for the applications, we will send out appFinished event to System metrics publisher to update the app status in ATS
but based on the patch, we will not do it. In this case, the ATS will never get the app status update(change the app status from start to finished) ? This looks like an issue which is broken by the patch.

Did I miss anything ?

Xuan Gong added a comment - 13/Apr/15 19:49 Naganarasimha G R Sorry for the late reply. So, the solution here is to avoid events sent to System metrics publisher during RM application recovery from state store. It looks fine to solve the current issue. But here is the case I am thinking right now might not work: we start RM, ATS correctly the RM failover/restart happens between the transition from FINAL_SAVING to FINISHED based on the original code, when we do the recovery for the applications, we will send out appFinished event to System metrics publisher to update the app status in ATS but based on the patch, we will not do it. In this case, the ATS will never get the app status update(change the app status from start to finished) ? This looks like an issue which is broken by the patch. Did I miss anything ?

Naganarasimha G R added a comment - 15/Apr/15 00:37

Thanks for the review Xuan Gong, Good catch... Well i am not sure why State is saved statestore in the the FINAL_SAVING, can we move it to FinalTransition, i.e. we can do it after we publish the event to the publisher then we can store the state in RMStatestore or vice versa ?
Also IMO depending on Where the RM will failover(killed/stopped) there can be chances that Entities are published to ATS after fail over. so would it be good to handle in the ATS side such that URL crash doesn't happen?

Naganarasimha G R added a comment - 15/Apr/15 00:37 Thanks for the review Xuan Gong , Good catch... Well i am not sure why State is saved statestore in the the FINAL_SAVING, can we move it to FinalTransition, i.e. we can do it after we publish the event to the publisher then we can store the state in RMStatestore or vice versa ? Also IMO depending on Where the RM will failover(killed/stopped) there can be chances that Entities are published to ATS after fail over. so would it be good to handle in the ATS side such that URL crash doesn't happen?

Li Lu added a comment - 01/May/15 18:52

Hi Naganarasimha G R, any updates on this JIRA? As pointed out by Xuan Gong, the current solution seems to have some problems. Therefore, I'm canceling this patch for now. Thanks!

Li Lu added a comment - 01/May/15 18:52 Hi Naganarasimha G R , any updates on this JIRA? As pointed out by Xuan Gong , the current solution seems to have some problems. Therefore, I'm canceling this patch for now. Thanks!

Naganarasimha G R added a comment - 06/May/15 00:09

Thanks for reviewing Li Lu,
Issue mentioned over here main cause is already addressed in another jira by Xuan Gong and but when we test in this way we still get to see null in the webui and also more importantly this jira addressing is required as events are published for every app (start and finished) on RM failover. So if 10000 apps are maintained then so many additional non required events are getting triggered. this we need to address. And for the issue pointed by Xuan Gong, i had asked for suggestion of approach being taken and hence waiting for it, AFAIK we need to ensure first ATS events are sent and then store the final application state to RMstate store in FINAL_SAVING transition (and also other possible cases where app is created and will be killed b4 attempt is created in which case FINAL_SAVING is not called). If this approach is fine then will update the patch and the description.

Naganarasimha G R added a comment - 06/May/15 00:09 Thanks for reviewing Li Lu , Issue mentioned over here main cause is already addressed in another jira by Xuan Gong and but when we test in this way we still get to see null in the webui and also more importantly this jira addressing is required as events are published for every app (start and finished) on RM failover. So if 10000 apps are maintained then so many additional non required events are getting triggered. this we need to address. And for the issue pointed by Xuan Gong , i had asked for suggestion of approach being taken and hence waiting for it, AFAIK we need to ensure first ATS events are sent and then store the final application state to RMstate store in FINAL_SAVING transition (and also other possible cases where app is created and will be killed b4 attempt is created in which case FINAL_SAVING is not called). If this approach is fine then will update the patch and the description.

Xuan Gong added a comment - 21/May/15 22:51

Link https://issues.apache.org/jira/browse/YARN-3701.

Xuan Gong added a comment - 21/May/15 22:51 Link https://issues.apache.org/jira/browse/YARN-3701 .

Naganarasimha G R added a comment - 22/May/15 00:07

Thanks Xuan Gong for pointing out this jira,
I was actually planning to change the description and scope of this jira. As what i am trying to solve here is stop unwanted timeline events which gets triggered on RM failover. If we observe even for the finished apps we get most of timeline events generated currently which is not needed as RM by default stores around 10000 apps. Also when i recently test the scenario mentioned above seems like it is already been corrected (before ~~YARN-3701~~) but some nulls are displayed in the webui. Little Held up, will try to modify the jira and rework on the patch based on your comments (handling sending of events on finished) asap.

Naganarasimha G R added a comment - 22/May/15 00:07 Thanks Xuan Gong for pointing out this jira, I was actually planning to change the description and scope of this jira. As what i am trying to solve here is stop unwanted timeline events which gets triggered on RM failover. If we observe even for the finished apps we get most of timeline events generated currently which is not needed as RM by default stores around 10000 apps. Also when i recently test the scenario mentioned above seems like it is already been corrected (before YARN-3701 ) but some nulls are displayed in the webui. Little Held up, will try to modify the jira and rework on the patch based on your comments (handling sending of events on finished) asap.

Naganarasimha G R added a comment - 24/Jul/15 18:06

Hi Xuan Gong,
I have modified the patch to work for the scenario you mentioned but in best effort basis it will try to avoid duplicated publish, such that events are published b4 saving it to statestore (failover happens after publishing and b4 saving to state store might result in multiple events published).
Based on state transition diagram, All the events are going through the final_saving state except for
New -> Finished (on RECOVER event)
New -> Failed (on RECOVER event)
New -> Killed (on KILL,RECOVER event)
Killing -> Finished (on ATTEMPT_FINSHED event)
running -> Finished (on ATTEMPT_FINSHED event)

first 2, No need to handle as the state would be published ATS b4 recovery.
for the 3rd one when Application is killed from New state then we need to explicitly publish
and also the last 2 state transitions needs to be handled which doesn't go through final_saving state.
Please review...

Naganarasimha G R added a comment - 24/Jul/15 18:06 Hi Xuan Gong , I have modified the patch to work for the scenario you mentioned but in best effort basis it will try to avoid duplicated publish, such that events are published b4 saving it to statestore (failover happens after publishing and b4 saving to state store might result in multiple events published). Based on state transition diagram, All the events are going through the final_saving state except for New -> Finished (on RECOVER event) New -> Failed (on RECOVER event) New -> Killed (on KILL,RECOVER event) Killing -> Finished (on ATTEMPT_FINSHED event) running -> Finished (on ATTEMPT_FINSHED event) first 2, No need to handle as the state would be published ATS b4 recovery. for the 3rd one when Application is killed from New state then we need to explicitly publish and also the last 2 state transitions needs to be handled which doesn't go through final_saving state. Please review...

Hadoop QA added a comment - 24/Jul/15 19:40

-1 overall

Vote	Subsystem	Runtime	Comment
0	pre-patch	16m 11s	Pre-patch trunk compilation is healthy.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	tests included	0m 0s	The patch appears to include 3 new or modified test files.
+1	javac	7m 47s	There were no new javac warning messages.
+1	javadoc	9m 39s	There were no new javadoc warning messages.
+1	release audit	0m 22s	The applied patch does not increase the total number of release audit warnings.
-1	checkstyle	0m 49s	The applied patch generated 1 new checkstyle issues (total was 150, now 151).
-1	whitespace	0m 1s	The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
+1	install	1m 20s	mvn install still works.
+1	eclipse:eclipse	0m 32s	The patch built with eclipse:eclipse.
+1	findbugs	1m 26s	The patch does not introduce any new Findbugs (version 3.0.0) warnings.
+1	yarn tests	51m 55s	Tests passed in hadoop-yarn-server-resourcemanager.
		90m 6s

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12747076/YARN-3127.20150624-1.patch
Optional Tests	javadoc javac unit findbugs checkstyle
git revision	trunk / f8f6091
checkstyle	https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
whitespace	https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/whitespace.txt
hadoop-yarn-server-resourcemanager test log	https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
Test Results	https://builds.apache.org/job/PreCommit-YARN-Build/8653/testReport/
Java	1.7.0_55
uname	Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Console output	https://builds.apache.org/job/PreCommit-YARN-Build/8653/console

This message was automatically generated.

Hadoop QA added a comment - 24/Jul/15 19:40 -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 11s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 3 new or modified test files. +1 javac 7m 47s There were no new javac warning messages. +1 javadoc 9m 39s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 49s The applied patch generated 1 new checkstyle issues (total was 150, now 151). -1 whitespace 0m 1s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 20s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 51m 55s Tests passed in hadoop-yarn-server-resourcemanager. 90m 6s Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12747076/YARN-3127.20150624-1.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / f8f6091 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/whitespace.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8653/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8653/testReport/ Java 1.7.0_55 uname Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8653/console This message was automatically generated.

Naganarasimha G R added a comment - 29/Jul/15 20:00

Hi Xuan Gong, Li Lu (inactive) & Tsuyoshi Ozawa,
Can any one you have a look at this jira ?

Naganarasimha G R added a comment - 29/Jul/15 20:00 Hi Xuan Gong , Li Lu (inactive) & Tsuyoshi Ozawa , Can any one you have a look at this jira ?

Naganarasimha G R added a comment - 17/Sep/15 18:34

Hi Xuan Gong/ Tsuyoshi Ozawa,
I feel the issue is valid and needs to be fixed, if one of you guys can take a look at the approach and the patch i mentioned earlier it would be helpful to get this jira moving.

Naganarasimha G R added a comment - 17/Sep/15 18:34 Hi Xuan Gong / Tsuyoshi Ozawa , I feel the issue is valid and needs to be fixed, if one of you guys can take a look at the approach and the patch i mentioned earlier it would be helpful to get this jira moving.

Naganarasimha G R added a comment - 23/Nov/15 13:19

Hi Sangjin Lee, Rohith Sharma K S & Xuan Gong, i have rebased the patch can you please take a look at it. Based on this we can get ~~YARN-4350~~ corrected.

Naganarasimha G R added a comment - 23/Nov/15 13:19 Hi Sangjin Lee , Rohith Sharma K S & Xuan Gong , i have rebased the patch can you please take a look at it. Based on this we can get YARN-4350 corrected.

Hadoop QA added a comment - 23/Nov/15 15:47

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 0s	Docker mode activated.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 3 new or modified test files.
+1	mvninstall	7m 48s	trunk passed
+1	compile	0m 29s	trunk passed with JDK v1.8.0_66
+1	compile	0m 32s	trunk passed with JDK v1.7.0_85
+1	checkstyle	0m 14s	trunk passed
+1	mvnsite	0m 38s	trunk passed
+1	mvneclipse	0m 15s	trunk passed
+1	findbugs	1m 15s	trunk passed
+1	javadoc	0m 22s	trunk passed with JDK v1.8.0_66
+1	javadoc	0m 29s	trunk passed with JDK v1.7.0_85
+1	mvninstall	0m 35s	the patch passed
+1	compile	0m 29s	the patch passed with JDK v1.8.0_66
+1	javac	0m 29s	the patch passed
+1	compile	0m 32s	the patch passed with JDK v1.7.0_85
-1	javac	3m 43s	hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85 with JDK v1.7.0_85 generated 1 new issues (was 2, now 2).
+1	javac	0m 32s	the patch passed
-1	checkstyle	0m 14s	Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 147, now 148).
+1	mvnsite	0m 38s	the patch passed
+1	mvneclipse	0m 16s	the patch passed
+1	whitespace	0m 0s	Patch has no whitespace issues.
+1	findbugs	1m 23s	the patch passed
+1	javadoc	0m 23s	the patch passed with JDK v1.8.0_66
+1	javadoc	0m 28s	the patch passed with JDK v1.7.0_85
-1	unit	59m 15s	hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66.
-1	unit	60m 26s	hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85.
+1	asflicense	0m 23s	Patch does not generate ASF License warnings.
		138m 10s

Reason	Tests
JDK v1.8.0_66 Failed junit tests	hadoop.yarn.server.resourcemanager.TestAMAuthorization
	hadoop.yarn.server.resourcemanager.TestClientRMTokens
JDK v1.7.0_85 Failed junit tests	hadoop.yarn.server.resourcemanager.TestAMAuthorization
	hadoop.yarn.server.resourcemanager.TestClientRMTokens

Subsystem	Report/Notes
Docker	Image:yetus/hadoop:0ca8df7
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12773836/YARN-3127.20151123-1.patch
JIRA Issue	~~YARN-3127~~
Optional Tests	asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
uname	Linux fb3380a093d6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	/testptch/hadoop/patchprocess/precommit/personality/provided.sh
git revision	trunk / 201f14e
findbugs	v3.0.0
javac	hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85: https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/diff-compile-javac-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85.txt
checkstyle	https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
unit	https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
unit	https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85.txt
unit test logs	https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85.txt
JDK v1.7.0_85 Test Results	https://builds.apache.org/job/PreCommit-YARN-Build/9766/testReport/
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Max memory used	75MB
Powered by	Apache Yetus http://yetus.apache.org
Console output	https://builds.apache.org/job/PreCommit-YARN-Build/9766/console

This message was automatically generated.

Hadoop QA added a comment - 23/Nov/15 15:47 -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 7m 48s trunk passed +1 compile 0m 29s trunk passed with JDK v1.8.0_66 +1 compile 0m 32s trunk passed with JDK v1.7.0_85 +1 checkstyle 0m 14s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 15s trunk passed +1 javadoc 0m 22s trunk passed with JDK v1.8.0_66 +1 javadoc 0m 29s trunk passed with JDK v1.7.0_85 +1 mvninstall 0m 35s the patch passed +1 compile 0m 29s the patch passed with JDK v1.8.0_66 +1 javac 0m 29s the patch passed +1 compile 0m 32s the patch passed with JDK v1.7.0_85 -1 javac 3m 43s hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85 with JDK v1.7.0_85 generated 1 new issues (was 2, now 2). +1 javac 0m 32s the patch passed -1 checkstyle 0m 14s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 147, now 148). +1 mvnsite 0m 38s the patch passed +1 mvneclipse 0m 16s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 23s the patch passed +1 javadoc 0m 23s the patch passed with JDK v1.8.0_66 +1 javadoc 0m 28s the patch passed with JDK v1.7.0_85 -1 unit 59m 15s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. -1 unit 60m 26s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85. +1 asflicense 0m 23s Patch does not generate ASF License warnings. 138m 10s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization hadoop.yarn.server.resourcemanager.TestClientRMTokens JDK v1.7.0_85 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization hadoop.yarn.server.resourcemanager.TestClientRMTokens Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12773836/YARN-3127.20151123-1.patch JIRA Issue YARN-3127 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux fb3380a093d6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 201f14e findbugs v3.0.0 javac hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85: https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/diff-compile-javac-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85.txt checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/9766/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_85.txt JDK v1.7.0_85 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9766/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 75MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9766/console This message was automatically generated.

Naganarasimha G R added a comment - 23/Nov/15 18:09

Seems like test failures are unrelated to the fix and check style is not valid.

Naganarasimha G R added a comment - 23/Nov/15 18:09 Seems like test failures are unrelated to the fix and check style is not valid.

Naganarasimha G R added a comment - 23/Nov/15 18:18

~~YARN-4306~~ and ~~YARN-4318~~ have been already raised for the test failures

Naganarasimha G R added a comment - 23/Nov/15 18:18 YARN-4306 and YARN-4318 have been already raised for the test failures

Naganarasimha G R added a comment - 05/Dec/15 17:46

Closing this jira based on the discussion in ~~YARN-4392~~, Conclusion is resending the events during recovery is ok, as there is probability that ATS events are not yet dispatched but RM fails over. But we need to just ensure that the data like event time should not be altered when resending events

Naganarasimha G R added a comment - 05/Dec/15 17:46 Closing this jira based on the discussion in YARN-4392 , Conclusion is resending the events during recovery is ok, as there is probability that ATS events are not yet dispatched but RM fails over. But we need to just ensure that the data like event time should not be altered when resending events

Hadoop YARN

Avoid timeline events during RM recovery or restart

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment