[PIG-273] Need to optimize the ways splits are handled, both in the top level plan and in nested plans. - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.2.0
Component/s: impl
Labels:
None

Description

Currently, in the new pipeline rework (see ~~PIG-157~~), splits in the data flow are not handled efficiently.

In the top level plans splits cause all the output data to be written to hdfs and then reread by each leg of the split. This forces both a read/write and a new map/reduce pass when it is not always necessary. For example, consider:

A = load 'myfile';
split A into B if $0 < 100, C if $0 >= 100;
B1 = group B by $0;
...
C1 = group B by $1;
...

In this case A will be loaded, and then immediately stored again. Then a plan will be executed that handles the B* part of the script, and then another executed that will handle the C* part of the script.

In nested plans, each projection of the generate is computed separately, even if they share common steps in the plan. For example:

B = group A by $0;
C= foreach B {
C1 = distinct $1;
C2 = filter C1 by $1 > 0;
generate group, COUNT(C1), COUNT(C2);
}

That will currently be executed with two nested plans, distinct->COUNT(C1) and distinct->filter->COUNT(C2). The same distinct will be computed twice. Ideally we would like to compute the distinct once and then split the output.

I suspect that optimizing the inner plan is more important because there are more situations where this occurs.

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Alan Gates

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Jun/08 15:48

Updated:: 02/Feb/15 17:47

Resolved:: 11/Feb/09 00:25

Agile

View on Board

Need to optimize the ways splits are handled, both in the top level plan and in nested plans.