Description
If we want to store different results that generated from the same input dataset, now we need to write two or several STORE clauses. And these STORE clauses will be translated to different mr jobs despite of these mr jobs may share scans of the same input datasets.
for example:
Dataset 'weather' contains the records of the weather. Each record contains three part : wind/air/tempreture. we need to process different part of the records.
we may write a pig script as below:
weather = load 'weather.txt' as (wind, air, tempreture);
wind_results = ... wind ...;
air_results = ...air...;
temp_results = ...tempreture...;
store wind_results into 'wind.results';
store air_results into 'air.results';
store temp_results into 'temp.results';
now pig will translate this script into three different MR jobs wich run sequencely: scan 'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again, process the air data, store the air results; ...
if the input data set is large, it is not efficient.
Attachments
Issue Links
- duplicates
-
PIG-607 Utilize intermediate results instead of re-execution
- Resolved