Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-614

reduce io during sharing scans of the same input datasets

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.2.0
    • Component/s: impl
    • Labels:
      None

      Description

      If we want to store different results that generated from the same input dataset, now we need to write two or several STORE clauses. And these STORE clauses will be translated to different mr jobs despite of these mr jobs may share scans of the same input datasets.

      for example:
      Dataset 'weather' contains the records of the weather. Each record contains three part : wind/air/tempreture. we need to process different part of the records.
      we may write a pig script as below:

      weather = load 'weather.txt' as (wind, air, tempreture);
      wind_results = ... wind ...;
      air_results = ...air...;
      temp_results = ...tempreture...;
      store wind_results into 'wind.results';
      store air_results into 'air.results';
      store temp_results into 'temp.results';

      now pig will translate this script into three different MR jobs wich run sequencely: scan 'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again, process the air data, store the air results; ...

      if the input data set is large, it is not efficient.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                hustlmsp Sijie Guo
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: