[PIG-614] reduce io during sharing scans of the same input datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 0.2.0
Fix Version/s: 0.2.0
Component/s: impl
Labels:
None

Description

If we want to store different results that generated from the same input dataset, now we need to write two or several STORE clauses. And these STORE clauses will be translated to different mr jobs despite of these mr jobs may share scans of the same input datasets.

for example:
Dataset 'weather' contains the records of the weather. Each record contains three part : wind/air/tempreture. we need to process different part of the records.
we may write a pig script as below:

weather = load 'weather.txt' as (wind, air, tempreture);
wind_results = ... wind ...;
air_results = ...air...;
temp_results = ...tempreture...;
store wind_results into 'wind.results';
store air_results into 'air.results';
store temp_results into 'temp.results';

now pig will translate this script into three different MR jobs wich run sequencely: scan 'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again, process the air data, store the air results; ...

if the input data set is large, it is not efficient.

Attachments

Issue Links

duplicates

PIG-607 Utilize intermediate results instead of re-execution

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Sijie Guo

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 12/Jan/09 07:30

Updated:: 24/Mar/10 22:04

Resolved:: 21/Jan/09 00:48