[YARN-3916] DrainDispatcher#await should wait till event has been completely handled - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: 2.7.0
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

2.7.2

Description

DrainDispatcher#await should wait till event has been completely handled.
Currently it only checks for whether event queue has become empty.

And in many tests we directly check for a state to be changed after calling await.
Sometimes, the states do not change by the time we check them as event has not been completely handled.

This is causing test failures such as ~~YARN-3909~~ and ~~YARN-3910~~ and may cause other test failures as well.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-3916.01.patch
11/Jul/15 22:41
3 kB
Varun Saxena
YARN-3916.02.patch
12/Jul/15 17:39
3 kB
Varun Saxena

Issue Links

is duplicated by

YARN-3909 TestAMRMRPCNodeUpdates#testAMRMUnusableNodes fails on trunk

Resolved

YARN-3910 TestRMAppTransitions#testAppAcceptedAttemptKilled fails on trunk

Resolved

YARN-3913 TestResourceTrackerService#testReconnectNode fails on trunk

Resolved

YARN-3918 TestApplicationCleanup#testContainerCleanup occasionally fails on trunk

Resolved

YARN-3878 AsyncDispatcher can hang while stopping if it is configured for draining events on stop

Closed

relates to

YARN-3909 TestAMRMRPCNodeUpdates#testAMRMUnusableNodes fails on trunk

Resolved

YARN-3910 TestRMAppTransitions#testAppAcceptedAttemptKilled fails on trunk

Resolved

YARN-3878 AsyncDispatcher can hang while stopping if it is configured for draining events on stop

Closed

(3 relates to)

Activity

Ascending order - Click to sort in descending order

Varun Saxena added a comment - 11/Jul/15 11:32

In ~~YARN-3878~~ we introduced check for whether event queue is empty in DrainDispatcher#await.
Even pre ~~YARN-3878~~, code was essentially doing the same but that was through a volatile flag. That may have failed sometimes as well.
But changes to volatile flag were not seen by other thread as quickly as checking for event queue being empty hence few of these tests were not failing and allowing async dispatcher to handle the event.

We should ideally check whether event has been handled in addition to event queue being empty.

Varun Saxena added a comment - 11/Jul/15 11:32 In YARN-3878 we introduced check for whether event queue is empty in DrainDispatcher#await. Even pre YARN-3878 , code was essentially doing the same but that was through a volatile flag. That may have failed sometimes as well. But changes to volatile flag were not seen by other thread as quickly as checking for event queue being empty hence few of these tests were not failing and allowing async dispatcher to handle the event. We should ideally check whether event has been handled in addition to event queue being empty.

Varun Saxena added a comment - 11/Jul/15 21:58

Pre ~~YARN-3878~~ code would have actually worked for DrainDispatcher.

Varun Saxena added a comment - 11/Jul/15 21:58 Pre YARN-3878 code would have actually worked for DrainDispatcher.

Varun Saxena added a comment - 11/Jul/15 22:06

Updated patch

Varun Saxena added a comment - 11/Jul/15 22:06 Updated patch

Hadoop QA added a comment - 11/Jul/15 23:30

+1 overall

Vote	Subsystem	Runtime	Comment
0	pre-patch	16m 12s	Pre-patch trunk compilation is healthy.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	tests included	0m 0s	The patch appears to include 1 new or modified test files.
+1	javac	7m 41s	There were no new javac warning messages.
+1	javadoc	9m 41s	There were no new javadoc warning messages.
+1	release audit	0m 22s	The applied patch does not increase the total number of release audit warnings.
+1	checkstyle	0m 56s	There were no new checkstyle issues.
+1	whitespace	0m 0s	The patch has no lines that end in whitespace.
+1	install	1m 21s	mvn install still works.
+1	eclipse:eclipse	0m 33s	The patch built with eclipse:eclipse.
+1	findbugs	1m 34s	The patch does not introduce any new Findbugs (version 3.0.0) warnings.
+1	yarn tests	1m 56s	Tests passed in hadoop-yarn-common.
		40m 19s

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12744910/YARN-3916.01.patch
Optional Tests	javadoc javac unit findbugs checkstyle
git revision	trunk / 1df39c1
hadoop-yarn-common test log	https://builds.apache.org/job/PreCommit-YARN-Build/8511/artifact/patchprocess/testrun_hadoop-yarn-common.txt
Test Results	https://builds.apache.org/job/PreCommit-YARN-Build/8511/testReport/
Java	1.7.0_55
uname	Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Console output	https://builds.apache.org/job/PreCommit-YARN-Build/8511/console

This message was automatically generated.

Hadoop QA added a comment - 11/Jul/15 23:30 +1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 12s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 41s There were no new javac warning messages. +1 javadoc 9m 41s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 56s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 21s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 34s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 1m 56s Tests passed in hadoop-yarn-common. 40m 19s Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12744910/YARN-3916.01.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 1df39c1 hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8511/artifact/patchprocess/testrun_hadoop-yarn-common.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8511/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8511/console This message was automatically generated.

Wangda Tan added a comment - 12/Jul/15 17:28

Thanks varun_saxena investigating this problem, I think patch generally looks good. I suggest to add some comments about why add a processingEvents variable.

Wangda Tan added a comment - 12/Jul/15 17:28 Thanks varun_saxena investigating this problem, I think patch generally looks good. I suggest to add some comments about why add a processingEvents variable.

Varun Saxena added a comment - 12/Jul/15 17:43

Thanks leftnoteasy for reviewing the patch.
Have added a comment as you said and updated a new patch.

Varun Saxena added a comment - 12/Jul/15 17:43 Thanks leftnoteasy for reviewing the patch. Have added a comment as you said and updated a new patch.

Hadoop QA added a comment - 12/Jul/15 18:20

+1 overall

Vote	Subsystem	Runtime	Comment
0	pre-patch	16m 13s	Pre-patch trunk compilation is healthy.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	tests included	0m 0s	The patch appears to include 1 new or modified test files.
+1	javac	7m 42s	There were no new javac warning messages.
+1	javadoc	9m 35s	There were no new javadoc warning messages.
+1	release audit	0m 22s	The applied patch does not increase the total number of release audit warnings.
+1	checkstyle	0m 56s	There were no new checkstyle issues.
+1	whitespace	0m 0s	The patch has no lines that end in whitespace.
+1	install	1m 21s	mvn install still works.
+1	eclipse:eclipse	0m 34s	The patch built with eclipse:eclipse.
+1	findbugs	1m 34s	The patch does not introduce any new Findbugs (version 3.0.0) warnings.
+1	yarn tests	1m 56s	Tests passed in hadoop-yarn-common.
		40m 16s

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12744944/YARN-3916.02.patch
Optional Tests	javadoc javac unit findbugs checkstyle
git revision	trunk / d7319de
hadoop-yarn-common test log	https://builds.apache.org/job/PreCommit-YARN-Build/8513/artifact/patchprocess/testrun_hadoop-yarn-common.txt
Test Results	https://builds.apache.org/job/PreCommit-YARN-Build/8513/testReport/
Java	1.7.0_55
uname	Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Console output	https://builds.apache.org/job/PreCommit-YARN-Build/8513/console

This message was automatically generated.

Hadoop QA added a comment - 12/Jul/15 18:20 +1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 13s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 42s There were no new javac warning messages. +1 javadoc 9m 35s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 56s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 21s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 34s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 1m 56s Tests passed in hadoop-yarn-common. 40m 16s Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12744944/YARN-3916.02.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / d7319de hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/8513/artifact/patchprocess/testrun_hadoop-yarn-common.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8513/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8513/console This message was automatically generated.

Wangda Tan added a comment - 12/Jul/15 21:40

varun_saxena,
Thanks for updating, but I think the comment may not accurate enough if I understand correctly: the flag == true when dispatcher is processing events OR dispatcher has pending events, maybe rename it to something like "isIdle" is more accurate.

Thoughts?

Wangda Tan added a comment - 12/Jul/15 21:40 varun_saxena , Thanks for updating, but I think the comment may not accurate enough if I understand correctly: the flag == true when dispatcher is processing events OR dispatcher has pending events, maybe rename it to something like "isIdle" is more accurate. Thoughts?

Karthik Kambatla added a comment - 13/Jul/15 06:23

Skimmed through the patch. Would it be simpler to have a boolean flag (eventBeingProcessed) turned on only when an event is being handled, and isProcessingEvents be !eventQueue.isEmpty() || eventBeingProcessed?

Karthik Kambatla added a comment - 13/Jul/15 06:23 Skimmed through the patch. Would it be simpler to have a boolean flag ( eventBeingProcessed ) turned on only when an event is being handled, and isProcessingEvents be !eventQueue.isEmpty() || eventBeingProcessed ?

Rohith Sharma K S added a comment - 13/Jul/15 06:40

How about keeping earlier variable drained only, this makes more appropriate I think.

Rohith Sharma K S added a comment - 13/Jul/15 06:40 How about keeping earlier variable drained only, this makes more appropriate I think.

Varun Saxena added a comment - 13/Jul/15 06:54

I thought of this first.
But issue here is that LinkedBlockingQueue#take is a blocking queue. It wont return from the call if there are no pending events.
If I set the flag before this call, it is possible that we are stuck forever there.
If I set it afterwards there can be a minor race wherein event queue may have taken from queue and made it empty but not updated the flag to true. In this case, we may wrongly judge that no events are being processed.

Varun Saxena added a comment - 13/Jul/15 06:54 I thought of this first. But issue here is that LinkedBlockingQueue#take is a blocking queue. It wont return from the call if there are no pending events. If I set the flag before this call, it is possible that we are stuck forever there. If I set it afterwards there can be a minor race wherein event queue may have taken from queue and made it empty but not updated the flag to true. In this case, we may wrongly judge that no events are being processed.

Varun Saxena added a comment - 13/Jul/15 07:21

Yeah variable can be named as drained. Because we name the dispatcher as DrainDispatcher as well.
I can mention in comments as to what it means.

Varun Saxena added a comment - 13/Jul/15 07:21 Yeah variable can be named as drained. Because we name the dispatcher as DrainDispatcher as well. I can mention in comments as to what it means.

Jian He added a comment - 13/Jul/15 07:23

Won't draining events on serviceStop have the same problem ?

Actually wondered why that flag was added in the first place, this should be the reason.
I think we can revert the change of ~~YARN-3878~~ and fix the problem of ~~YARN-3878~~ properly ?

Jian He added a comment - 13/Jul/15 07:23 Won't draining events on serviceStop have the same problem ? Actually wondered why that flag was added in the first place, this should be the reason. I think we can revert the change of YARN-3878 and fix the problem of YARN-3878 properly ?

Varun Saxena added a comment - 13/Jul/15 07:36

Won't draining events on serviceStop have the same problem ?

Yes, you are correct. It will have the same problem. We may judge that queue is empty and go on to interrupt the thread while event is being handled.

I think we can revert the change of ~~YARN-3878~~ and fix the problem of ~~YARN-3878~~ properly ?

Yeah lets do that. I think when Interrupted exception is thrown on put(issue in ~~YARN-3878~~), we can reset the flag to false if queue is empty.
Thoughts ?

Varun Saxena added a comment - 13/Jul/15 07:36 Won't draining events on serviceStop have the same problem ? Yes, you are correct. It will have the same problem. We may judge that queue is empty and go on to interrupt the thread while event is being handled. I think we can revert the change of YARN-3878 and fix the problem of YARN-3878 properly ? Yeah lets do that. I think when Interrupted exception is thrown on put(issue in YARN-3878 ), we can reset the flag to false if queue is empty. Thoughts ?

Varun Saxena added a comment - 13/Jul/15 15:14

jianhe, I have an added and addendum patch on ~~YARN-3878~~.
It adds the previous drained flag, reset it on InterruptedException and kept the bits related to ~~YARN-3878~~ which were required.

Varun Saxena added a comment - 13/Jul/15 15:14 jianhe , I have an added and addendum patch on YARN-3878 . It adds the previous drained flag, reset it on InterruptedException and kept the bits related to YARN-3878 which were required.

Jian He added a comment - 13/Jul/15 21:28

thanks varun_saxena ! re-open ~~YARN-3878~~ and close this as a dup of that.
will look at ~~YARN-3878~~.

Jian He added a comment - 13/Jul/15 21:28 thanks varun_saxena ! re-open YARN-3878 and close this as a dup of that. will look at YARN-3878 .

People

Assignee:: Varun Saxena

Reporter:: Varun Saxena

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Jul/15 11:18

Updated:: 30/Oct/15 06:03

Resolved:: 13/Jul/15 21:28

Hadoop YARN

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates