[TEZ-2943] Change shuffle vertex manager to use per vertex data for auto reduce and slow start - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.7.1, 0.8.2
Component/s: None
Labels:
None

Target Version/s:

0.7.1
Hadoop Flags:

Reviewed

Description

Excerpts from a DAG execution log highlight the issue.
A DAG has a vertex scope-212 that has two input sources scope-210 and scope-211. The input properties have the following data movement properties.

scope-211: 
-Tasks count: 72
-OUTPUT_BYTES:93,975,296
scope-210
-Tasks count: 5
-OUTPUT_BYTES: 2,315,364,586

Here is scope-212 auto reduce parallelism kicking in

2015-11-11 19:46:28,829 [INFO] [App Shared Pool - #0] |vertexmanager.ShuffleVertexManager|: Reduce auto parallelism for vertex: scope-212 to 1 from 56 . Expected output: 101293660 based on actual output: 76299121 from 58 vertex manager events.  desiredTaskInputSize: 134217728 max slow start tasks:57.75 num sources completed:58

Some more background on why we determined the auto parallelism and started scheduling tasks

2015-11-11 19:46:28,829 [INFO] [App Shared Pool - #0] |vertexmanager.ShuffleVertexManager|: Scheduling 56 tasks for vertex: scope-212 with totalTasks: 56. 58 source tasks completed out of 77. SourceTaskCompletedFraction: 0.7532467 min: 0.25 max: 0.75

It made this decision since 58/77 is roughly 75%. Most of the data came from scope-210. let's see how many of the 58 source tasks completed are from scope-210

This is the first task attempt from scope-210

2015-11-11 19:47:02,862 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1446259040496_444992_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=scope-210, taskAttemptId=attempt_1446259040496_444992_1_02_000004_0, creationTime=1447271172510, allocationTime=1447271175440, startTime=1447271181036, finishTime=1447271222861, timeTaken=41825, status=SUCCEEDED, errorEnum=, diagnostics=

This is almost 30 seconds after the auto-reduce parallelism kicked in. So it seems we have a serious auto-reduce parallelism bug since it 1) doesn't require 75% percent of each source vertex and/or 2) it doesn't weight the expected bytes per vertex.

This means that it launched only 1 task for 2.5G of data instead of 16

This may relate to TEZ-1532 since events do not signify which source vertex or task they originate from.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TEZ-2943.3.patch
10/Dec/15 00:57
47 kB
Bikas Saha
TEZ-2943.2.patch
25/Nov/15 19:46
47 kB
Bikas Saha
TEZ-2943.1.patch
25/Nov/15 03:59
37 kB
Bikas Saha

Issue Links

relates to

TEZ-2987 TestVertexImpl.testTez2684 fails

Closed

Activity

People

Assignee:: Bikas Saha

Reporter:: Jonathan Turner Eagles

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Nov/15 03:18

Updated:: 18/May/16 04:55

Resolved:: 10/Dec/15 01:01