Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-614

reduce io during sharing scans of the same input datasets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 0.2.0
    • 0.2.0
    • impl
    • None

    Description

      If we want to store different results that generated from the same input dataset, now we need to write two or several STORE clauses. And these STORE clauses will be translated to different mr jobs despite of these mr jobs may share scans of the same input datasets.

      for example:
      Dataset 'weather' contains the records of the weather. Each record contains three part : wind/air/tempreture. we need to process different part of the records.
      we may write a pig script as below:

      weather = load 'weather.txt' as (wind, air, tempreture);
      wind_results = ... wind ...;
      air_results = ...air...;
      temp_results = ...tempreture...;
      store wind_results into 'wind.results';
      store air_results into 'air.results';
      store temp_results into 'temp.results';

      now pig will translate this script into three different MR jobs wich run sequencely: scan 'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again, process the air data, store the air results; ...

      if the input data set is large, it is not efficient.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hustlmsp Sijie Guo
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: