[PIG-3835] Improve performance of union - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: tez-branch
Fix Version/s: tez-branch
Component/s: tez
Labels:
None

Hadoop Flags:

Reviewed

Description

~~PIG-3743~~ implements union using VertexGroup. But there are a couple of optimizations that we can apply to it.

Union followed by store
Union is a blocking operator meaning that a new vertex is added for its succeeding operators. But if there is only one store in the succeeding vertex, MROutput could be directly attached to VertexGroup instead of adding a new vertex for it. Then, each union source vertex will write directly to the destination, and therefore, it will be faster.

Replace POLocalRearrangeTez with POValueOutputTez
Union uses POLocalRearrange by setting the whole record as key. But since union only needs to partition records evenly across tasks, it might make more sense to use POValueOutputTez with RR partitioner instead.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-3835-2.patch
01/Apr/14 06:25
151 kB
Rohini Palaniswamy
PIG-3835-3.patch
01/Apr/14 19:34
152 kB
Rohini Palaniswamy
PIG-3835-addendum-1.patch
01/Apr/14 23:17
9 kB
Rohini Palaniswamy
PIG-3835-Initial-1.patch
31/Mar/14 09:11
121 kB
Rohini Palaniswamy

Issue Links

is related to

PIG-3743 Use VertexGroup and Alias vertex for union

Closed

relates to

PIG-3855 Turn on UnionOptimizer by default and add new e2e tests for union

Closed

requires

TEZ-1003 Need a input that merges multiple ShuffleMergedInput from VertexGroup

Closed

Activity

People

Assignee:: Rohini Palaniswamy

Reporter:: Cheolsoo Park

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Mar/14 22:26

Updated:: 21/Nov/14 05:59

Resolved:: 02/Apr/14 14:30