[TEZ-1494] DAG hangs waiting for ShuffleManager.getNextInput() - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.5.1
Component/s: None
Labels:
- performance

Target Version/s:

0.5.1
Hadoop Flags:

Reviewed

Description

Attaching the DAG and the stack trace of the hung process.

Thread 30071: (state = BLOCKED)

sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame)
java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame)
org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=610 (Interpreted frame)
org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame)
org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TEZ-1494.1.patch
02/Sep/14 06:04
7 kB
Rajesh Balamohan
TEZ-1494.2.patch
03/Sep/14 04:30
11 kB
Rajesh Balamohan
TEZ-1494.3.patch
04/Sep/14 08:27
26 kB
Rajesh Balamohan
TEZ-1494.4.patch
09/Sep/14 12:24
29 kB
Rajesh Balamohan
TEZ-1494.5.patch
09/Sep/14 21:38
29 kB
Rajesh Balamohan
TEZ-1494-DAG.dot
25/Aug/14 12:49
6 kB
Rajesh Balamohan

Activity

Descending order - Click to sort in ascending order

Hitesh Shah added a comment - 11/Sep/14 05:11

Setting the fix version to the lowest version the patch was committed to.

Hitesh Shah added a comment - 11/Sep/14 05:11 Setting the fix version to the lowest version the patch was committed to.

Rajesh Balamohan added a comment - 10/Sep/14 21:44

Agreed. Thanks bikassaha. Committed to branch-0.5

commit 176b12cde8421582e05afbf1cb5e1b46f3cf0d38
Author: Rajesh Balamohan <rbalamohan@apache.org>
Date: Thu Sep 11 03:11:48 2014 +0530

~~TEZ-1494~~. DAG hangs waiting for ShuffleManager.getNextInput() (Rajesh Balamohan)

Rajesh Balamohan added a comment - 10/Sep/14 21:44 Agreed. Thanks bikassaha . Committed to branch-0.5 commit 176b12cde8421582e05afbf1cb5e1b46f3cf0d38 Author: Rajesh Balamohan <rbalamohan@apache.org> Date: Thu Sep 11 03:11:48 2014 +0530 TEZ-1494 . DAG hangs waiting for ShuffleManager.getNextInput() (Rajesh Balamohan)

Bikas Saha added a comment - 10/Sep/14 18:17

lgtm. lets get this in for 0.5.1 and continue to investigate these corner cases. I dont think we are done with these issues yet. The shuffle case may still have issues as well as the corner case you mentioned in ~~TEZ-1522~~. We can start working on integrating the notification mechanism with the VMs and use that for 0.6.0

Bikas Saha added a comment - 10/Sep/14 18:17 lgtm. lets get this in for 0.5.1 and continue to investigate these corner cases. I dont think we are done with these issues yet. The shuffle case may still have issues as well as the corner case you mentioned in TEZ-1522 . We can start working on integrating the notification mechanism with the VMs and use that for 0.6.0

Rajesh Balamohan added a comment - 09/Sep/14 21:38

bikassahaGot it; fixed it in latest patch.

Rajesh Balamohan added a comment - 09/Sep/14 21:38 bikassaha Got it; fixed it in latest patch.

Siddharth Seth added a comment - 09/Sep/14 17:39

Since the fix is within plugins, we'll likely need to inform Hive and Pig to make similar changes - if they're required. Hive, for example, uses a VertexManagerPlugin to process RootInputs.

Siddharth Seth added a comment - 09/Sep/14 17:39 Since the fix is within plugins, we'll likely need to inform Hive and Pig to make similar changes - if they're required. Hive, for example, uses a VertexManagerPlugin to process RootInputs.

Bikas Saha added a comment - 09/Sep/14 17:08

Unless I am following the flow incorrectly, onSourceTaskCompleted() calls scheduleTasks() which calls canScheduleTasks() and then schedules all tasks. Since onSourceTaskCompleted can be called multiple times scheduleTasks() will be called multiple times and thus we can end up scheduling the tasks multiple times. canScheduleTasks()/scheduleTasks() dont seem to have any checks against having already scheduled everything. I am looking at .4 patch. Am I missing something?

Bikas Saha added a comment - 09/Sep/14 17:08 Unless I am following the flow incorrectly, onSourceTaskCompleted() calls scheduleTasks() which calls canScheduleTasks() and then schedules all tasks. Since onSourceTaskCompleted can be called multiple times scheduleTasks() will be called multiple times and thus we can end up scheduling the tasks multiple times. canScheduleTasks()/scheduleTasks() dont seem to have any checks against having already scheduled everything. I am looking at .4 patch. Am I missing something?

Rajesh Balamohan added a comment - 09/Sep/14 12:24

"Boolean taskIsFinished[]" was added earlier to track duplicate completions. But, in this case it is irrelevant. So retaining just numFinishedTasks in the latest patch. Removed numTasks as well in latest patch.
Multiple schedulings are prevented via canScheduleTasks() which is invoked within scheduleTasks()
Removed ImmediateStartVertexManager specific checks from testcases. Added CustomEdge in the testcase.

Rajesh Balamohan added a comment - 09/Sep/14 12:24 "Boolean taskIsFinished[]" was added earlier to track duplicate completions. But, in this case it is irrelevant. So retaining just numFinishedTasks in the latest patch. Removed numTasks as well in latest patch. Multiple schedulings are prevented via canScheduleTasks() which is invoked within scheduleTasks() Removed ImmediateStartVertexManager specific checks from testcases. Added CustomEdge in the testcase.

Bikas Saha added a comment - 09/Sep/14 01:38

From what I see, both of these can be replaced by a single boolean, right? We are only interested in 1 completion.

+    int numFinishedTasks;
+    Boolean taskIsFinished[];
+

Not sure how we are preventing multiple schedulings of the tasks because scheduleTasks() is now being called on every onSourceTaskCompleted().

This code should probably be removed since we are trying to test the behavior and not the exact internal impl. The impl could change but the behavior should not. Right? This would also allow us to make this method private.

+    assertTrue(((ImmediateStartVertexManager)m5.getVertexManager().getPlugin()).canScheduleTasks() == false);

Looks like the test is only covering the ImmediateStartManager case. Adding a custom edge between M7 and a new vertex (with the new vertex having a RootInputVertexManager would cover the remaining cases. If that gets hard to write then we should at least add M7 to a new vertex with custom edge (no RootInputManager).

Bikas Saha added a comment - 09/Sep/14 01:38 From what I see, both of these can be replaced by a single boolean, right? We are only interested in 1 completion. + int numFinishedTasks; + Boolean taskIsFinished[]; + Not sure how we are preventing multiple schedulings of the tasks because scheduleTasks() is now being called on every onSourceTaskCompleted(). This code should probably be removed since we are trying to test the behavior and not the exact internal impl. The impl could change but the behavior should not. Right? This would also allow us to make this method private. + assertTrue(((ImmediateStartVertexManager)m5.getVertexManager().getPlugin()).canScheduleTasks() == false ); Looks like the test is only covering the ImmediateStartManager case. Adding a custom edge between M7 and a new vertex (with the new vertex having a RootInputVertexManager would cover the remaining cases. If that gets hard to write then we should at least add M7 to a new vertex with custom edge (no RootInputManager).

Bikas Saha added a comment - 09/Sep/14 01:21

We may put this patch in for 0.5.1 as a bug fix. I plan to work on ~~TEZ-1547~~ and change all existing VMs where needed. That would go into 0.6.0 as its a major change.
I will have comments for this jira shortly.

Bikas Saha added a comment - 09/Sep/14 01:21 We may put this patch in for 0.5.1 as a bug fix. I plan to work on TEZ-1547 and change all existing VMs where needed. That would go into 0.6.0 as its a major change. I will have comments for this jira shortly.

Rajesh Balamohan added a comment - 09/Sep/14 00:46

Thanks Bikas. I believe Tez-1447 is specific to InputInitializers which can not be used directly in this case; Tez-1547 would be relevant for this case. Performance would definitely be better if we rely on vertex-started notification event (as tez would be able to start scheduling instead of waiting for every task to be completed from each source vertex). Should we wait for ~~TEZ-1547~~ to be completed? Or should we proceed with the current patch and refactor to event based approach when ~~TEZ-1547~~ is done. Thoughts?

Rajesh Balamohan added a comment - 09/Sep/14 00:46 Thanks Bikas. I believe Tez-1447 is specific to InputInitializers which can not be used directly in this case; Tez-1547 would be relevant for this case. Performance would definitely be better if we rely on vertex-started notification event (as tez would be able to start scheduling instead of waiting for every task to be completed from each source vertex). Should we wait for TEZ-1547 to be completed? Or should we proceed with the current patch and refactor to event based approach when TEZ-1547 is done. Thoughts?

Bikas Saha added a comment - 09/Sep/14 00:25

Right ~~TEZ-1447~~.

Bikas Saha added a comment - 09/Sep/14 00:25 Right TEZ-1447 .

Siddharth Seth added a comment - 08/Sep/14 18:23

rajesh.balamohan - the corner case is what I was referring to. If not addressing in this jira, we should create a separate one to track the issue. I'm not sure if Hive/Pig set the source fractions when they configure a ShuffleEdge.

Siddharth Seth added a comment - 08/Sep/14 18:23 rajesh.balamohan - the corner case is what I was referring to. If not addressing in this jira, we should create a separate one to track the issue. I'm not sure if Hive/Pig set the source fractions when they configure a ShuffleEdge.

Rajesh Balamohan added a comment - 08/Sep/14 09:19

bikassaha, sorry, are you referring to tez-1447 or some other ticket?

Rajesh Balamohan added a comment - 08/Sep/14 09:19 bikassaha , sorry, are you referring to tez-1447 or some other ticket?

Rajesh Balamohan added a comment - 08/Sep/14 07:28 - edited

@ssseth
Short answer: Yes, but remote chance.

If min/max fraction is set to 0.0 at global level (i.e via tez.shuffle-vertex-manager.min-src-fraction=0.0 in tez-site.xml), ShuffleVertexManager wouldn't change the parallelism as per the current logic. So this wouldn't be a problem.
With default min/max (0.25-0.75) or any range, there won't be a problem as ShuffleVertexManager would wait for some source tasks to finish
Corner case:
: Set one of the vertex's min/max to (0.25/0.75) and downstream vertex to (0.0/0.0).
: Since downstream is set to 0.0, their tasks would start immediately without knowing changes in upstream vertex.
: If upstream vertex changes parallelism, Downstream vertex wouldn't change its parallelism.
: However, this requires individual min/max setting for each vertex.

Rajesh Balamohan added a comment - 08/Sep/14 07:28 - edited @ssseth Short answer: Yes, but remote chance. If min/max fraction is set to 0.0 at global level (i.e via tez.shuffle-vertex-manager.min-src-fraction=0.0 in tez-site.xml), ShuffleVertexManager wouldn't change the parallelism as per the current logic. So this wouldn't be a problem. With default min/max (0.25-0.75) or any range, there won't be a problem as ShuffleVertexManager would wait for some source tasks to finish Corner case: : Set one of the vertex's min/max to (0.25/0.75) and downstream vertex to (0.0/0.0). : Since downstream is set to 0.0, their tasks would start immediately without knowing changes in upstream vertex. : If upstream vertex changes parallelism, Downstream vertex wouldn't change its parallelism. : However, this requires individual min/max setting for each vertex.

Siddharth Seth added a comment - 07/Sep/14 08:53

rajesh.balamohan - can we run into the same case on Shuffle edges as well - if the min and max fraction are set to 0 ? A simple M - R - R , with a shuffle connection between all vertices.

Siddharth Seth added a comment - 07/Sep/14 08:53 rajesh.balamohan - can we run into the same case on Shuffle edges as well - if the min and max fraction are set to 0 ? A simple M - R - R , with a shuffle connection between all vertices.

Bikas Saha added a comment - 07/Sep/14 02:12

What do you think of the comment above about using the pubsub mechanism? That may result in faster scheduling than waiting for source task completion. I am holding off review for this until I hear your views.

Bikas Saha added a comment - 07/Sep/14 02:12 What do you think of the comment above about using the pubsub mechanism? That may result in faster scheduling than waiting for source task completion. I am holding off review for this until I hear your views.

Rajesh Balamohan added a comment - 04/Sep/14 08:27 - edited

Adding test cases to the patch & addressing review comments.

Rajesh Balamohan added a comment - 04/Sep/14 08:27 - edited Adding test cases to the patch & addressing review comments.

Bikas Saha added a comment - 04/Sep/14 02:57 - edited

Alternatively we could wait for ~~TEZ-1494~~. Then we can simply register for vertex started running notification and schedule all vertices once that notification has been received. That would be much simpler than having to monitor for essentially the same thing and faster since we dont have to wait for tasks to complete before we schedule tasks. However for that to work vertex started running notification needs to come when the vertex actually starts running (schedules tasks) instead of when the vertex state machine enters running state. Or maybe add a new notification saying vertex started scheduling.

Bikas Saha added a comment - 04/Sep/14 02:57 - edited Alternatively we could wait for TEZ-1494 . Then we can simply register for vertex started running notification and schedule all vertices once that notification has been received. That would be much simpler than having to monitor for essentially the same thing and faster since we dont have to wait for tasks to complete before we schedule tasks. However for that to work vertex started running notification needs to come when the vertex actually starts running (schedules tasks) instead of when the vertex state machine enters running state. Or maybe add a new notification saying vertex started scheduling.

Rajesh Balamohan added a comment - 03/Sep/14 04:44

Review board link: https://reviews.apache.org/r/25287/diff/#

Rajesh Balamohan added a comment - 03/Sep/14 04:44 Review board link: https://reviews.apache.org/r/25287/diff/#

Rajesh Balamohan added a comment - 03/Sep/14 04:36

If the approach listed in latest patch is fine, do we really need ImmediateStartVertexManager?

Rajesh Balamohan added a comment - 03/Sep/14 04:36 If the approach listed in latest patch is fine, do we really need ImmediateStartVertexManager?

Rajesh Balamohan added a comment - 02/Sep/14 06:04

bikassaha - Can you please review?

Rajesh Balamohan added a comment - 02/Sep/14 06:04 bikassaha - Can you please review?

Bikas Saha added a comment - 27/Aug/14 18:08

problem wherein downstream vertex connected via broadcast edge is not updated when the parallelism is changed.

This is tracked by ~~TEZ-1059~~.

Bikas Saha added a comment - 27/Aug/14 18:08 problem wherein downstream vertex connected via broadcast edge is not updated when the parallelism is changed. This is tracked by TEZ-1059 .

Siddharth Seth added a comment - 27/Aug/14 17:54

We end up initializing a vertex when all of the following are met 1) initializer is complete, 2) edges are setup, 3) parallelism is not -1. All three conditions would be valid for Reducer3, so it would end up allowing Map5 (dependent vertex) to start.
We currently have no way of knowing whether a Vertex will change parallelism - and whether we should block for such an operation. Alternately, we'll have to end up updating the downstream tasks with the new parallelism information - which may be a better way to deal with this since parallelism could potentially change multiple times at a later point.

Siddharth Seth added a comment - 27/Aug/14 17:54 We end up initializing a vertex when all of the following are met 1) initializer is complete, 2) edges are setup, 3) parallelism is not -1. All three conditions would be valid for Reducer3, so it would end up allowing Map5 (dependent vertex) to start. We currently have no way of knowing whether a Vertex will change parallelism - and whether we should block for such an operation. Alternately, we'll have to end up updating the downstream tasks with the new parallelism information - which may be a better way to deal with this since parallelism could potentially change multiple times at a later point.

Rajesh Balamohan added a comment - 27/Aug/14 07:19

Added small testcase in https://github.com/rajeshbalamohan/tez-1494 which can be run from local-vm to reproduce the issue. With -Dtez.shuffle-vertex-manager.enable.auto-parallel=false, DAG would succeed. Initially thought, it was due to slow-start kicking in too early, but it appears to be a problem wherein downstream vertex connected via broadcast edge is not updated when the parallelism is changed.

Rajesh Balamohan added a comment - 27/Aug/14 07:19 Added small testcase in https://github.com/rajeshbalamohan/tez-1494 which can be run from local-vm to reproduce the issue. With -Dtez.shuffle-vertex-manager.enable.auto-parallel=false, DAG would succeed. Initially thought, it was due to slow-start kicking in too early, but it appears to be a problem wherein downstream vertex connected via broadcast edge is not updated when the parallelism is changed.

Rajesh Balamohan added a comment - 26/Aug/14 01:09

Issue happens when auto parallelism is enabled.

Reducer 3 starts with 2 tasks
Map 5 (has 1 task and has dependency on Reducer 3) starts before Reducer 3
Reducer 3 alters parallelism from 2 to 1
Map 5 keeps waiting for inputs from 2 tasks of Reducer 3.

Rajesh Balamohan added a comment - 26/Aug/14 01:09 Issue happens when auto parallelism is enabled. Reducer 3 starts with 2 tasks Map 5 (has 1 task and has dependency on Reducer 3) starts before Reducer 3 Reducer 3 alters parallelism from 2 to 1 Map 5 keeps waiting for inputs from 2 tasks of Reducer 3.

Hitesh Shah added a comment - 25/Aug/14 22:44

Should this be a blocker for 0.5.0?

Hitesh Shah added a comment - 25/Aug/14 22:44 Should this be a blocker for 0.5.0?

Siddharth Seth added a comment - 25/Aug/14 22:27

rajesh.balamohan - have you investigated this any further ? Were all the DataMovementEvents received, was task retry in play ? etc

Siddharth Seth added a comment - 25/Aug/14 22:27 rajesh.balamohan - have you investigated this any further ? Were all the DataMovementEvents received, was task retry in play ? etc

Rajesh Balamohan added a comment - 25/Aug/14 22:19

Original bug was reported with 0.6.0-snapshot. I just retried with 0.5.0-rc1 and its present in that version as well.

Rajesh Balamohan added a comment - 25/Aug/14 22:19 Original bug was reported with 0.6.0-snapshot. I just retried with 0.5.0-rc1 and its present in that version as well.

Hitesh Shah added a comment - 25/Aug/14 18:25

rajesh.balamohan Is this an issue present in the 0.5.0 RC?

Hitesh Shah added a comment - 25/Aug/14 18:25 rajesh.balamohan Is this an issue present in the 0.5.0 RC?

People

Assignee:: Rajesh Balamohan

Reporter:: Rajesh Balamohan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Aug/14 12:47

Updated:: 02/Oct/14 21:41

Resolved:: 10/Sep/14 21:44

Apache Tez

Details

Description

Attachments

Attachments

Activity

People

Dates