Pig
  1. Pig
  2. PIG-2423

document use case where co-group is better choice than join

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: documentation
    • Labels:
      None

      Description

      Optimization rules 2 and 3 suggested in https://issues.apache.org/jira/secure/attachment/12506841/pig_tpch.ppt (PIG-2397) recommend the use of co-group instead of join in certain cases. These should be documented in pig performance page.

        Issue Links

          Activity

          Julien Le Dem made changes -
          Fix Version/s 0.11 [ 12318878 ]
          Hide
          Julien Le Dem added a comment -

          This will go in a future release

          Show
          Julien Le Dem added a comment - This will go in a future release
          Hide
          Olga Natkovich added a comment -

          Thejas, should this be assigned to you? Is this going to go into 0.11 or 0.12?

          Show
          Olga Natkovich added a comment - Thejas, should this be assigned to you? Is this going to go into 0.11 or 0.12?
          Daniel Dai made changes -
          Fix Version/s 0.11 [ 12318878 ]
          Fix Version/s 0.10.0 [ 12316246 ]
          Hide
          Thejas M Nair added a comment -

          There is an existing jira for supporting combiner in co-group - PIG-1735.

          Another more general approach might be enriching the multi-query optimization such that the common partitioning between two jobs is recognized.

          We can discuss this on a new jira.

          Show
          Thejas M Nair added a comment - There is an existing jira for supporting combiner in co-group - PIG-1735 . Another more general approach might be enriching the multi-query optimization such that the common partitioning between two jobs is recognized. We can discuss this on a new jira.
          Hide
          Jie Li added a comment -

          Oh I didn't notice that and sure we need to mention that.

          Just curious if it's possible to enable the combiner for the co-group? I believe it's worthy to make such effort to fully take the advantage of co-group against the join in certain cases.

          Another more general approach might be enriching the multi-query optimization such that the common partitioning between two jobs is recognized.

          Shall we open new tickets to discuss them?

          Show
          Jie Li added a comment - Oh I didn't notice that and sure we need to mention that. Just curious if it's possible to enable the combiner for the co-group? I believe it's worthy to make such effort to fully take the advantage of co-group against the join in certain cases. Another more general approach might be enriching the multi-query optimization such that the common partitioning between two jobs is recognized. Shall we open new tickets to discuss them?
          Hide
          Thejas M Nair added a comment -

          For case 1, in some cases, the query 1 that uses join might perform better. This is because co-group does not currently use combiner.
          In query 1, combiner would run in the map task to reduce its output size. If there are only a few unique values for the keys, the map output would be very small, and the IO between the map and reduce would go down drastically . The output size of the first MR job would also be relatively very small in such a case. The savings on IO is likely to be more than cost of an extra MR job in such case.

          So for case 1, I think it makes sense to add a clause - "Note that the use of co-group stops combiner from getting used in current version of pig. So if the aggregation in query 1 will use combiner (depends on the udf interface) and the output size of aggregation is going to be relatively very small, the benefits of reduced IO because of combiner use is likely to outweigh the cost of additional MR job".

          Show
          Thejas M Nair added a comment - For case 1, in some cases, the query 1 that uses join might perform better. This is because co-group does not currently use combiner. In query 1, combiner would run in the map task to reduce its output size. If there are only a few unique values for the keys, the map output would be very small, and the IO between the map and reduce would go down drastically . The output size of the first MR job would also be relatively very small in such a case. The savings on IO is likely to be more than cost of an extra MR job in such case. So for case 1, I think it makes sense to add a clause - "Note that the use of co-group stops combiner from getting used in current version of pig. So if the aggregation in query 1 will use combiner (depends on the udf interface) and the output size of aggregation is going to be relatively very small, the benefits of reduced IO because of combiner use is likely to outweigh the cost of additional MR job".
          Hide
          Jie Li added a comment -

          Thanks Thejas. For this moment I just paste here. I add two cases, and I'm thinking if they can be more general. Feel free to improve them.

          1. Use COGROUP to do the join
          
          When there are GROUP-BY and JOIN on the same keys, we can usually combine them using COGROUP to reduce the number of MapReduce jobs. 
          
          -- Query 1
          A = load 'myfile' as (x, u, v);
          B = load 'myotherfile' as (x, y, z);
          
          t1 = group B by B.x;
          t2 = foreach t1 generate group as x, COUNT(B.y) as count_y;
          t3 = join A by A.x, t2 by t2.x;
          
          -- Query 2
          A = load 'myfile' as (x, u, v);
          B = load 'myotherfile' as (x, y, z);
          
          t1 = cogroup A by A.x, B by B.x;
          t2 = filter t1 by NOT IsEmpty(A) AND NOT IsEmpty(B); -- an inner join
          t3 = foreach t2 generate group, COUNT(B.y);
          
          While the Query 1 requires two separate MR jobs, the Query 2 only requires one MR job by using the COGROUP.
          
          2. Use GROUP+FLATTEN to do the self join
          
          Sometimes we need a self join to get some additional information. For example, for each employer, find the average salary in his/her department.
          
          -- Query 1
          A = load 'myfile' as (name, salary, department);
          t1 = group A by department;
          t2 = foreach t1 generate group, AVG(A.salary) as avg_salary;
          t3 = join A by department, t2 by group;
          
          -- Query 2
          A = load 'myfile' as (name, salary, department);
          t1 = group A by department;
          t2 = foreach t1 generate FLATTEN(A),  AVG(A.salary) as avg_salary;
          
          While the Query 1 needs two MR jobs, the Query 2 only requires one MR job by using FLATTEN after GROUP to implement the self join.
          
          Show
          Jie Li added a comment - Thanks Thejas. For this moment I just paste here. I add two cases, and I'm thinking if they can be more general. Feel free to improve them. 1. Use COGROUP to do the join When there are GROUP-BY and JOIN on the same keys, we can usually combine them using COGROUP to reduce the number of MapReduce jobs. -- Query 1 A = load 'myfile' as (x, u, v); B = load 'myotherfile' as (x, y, z); t1 = group B by B.x; t2 = foreach t1 generate group as x, COUNT(B.y) as count_y; t3 = join A by A.x, t2 by t2.x; -- Query 2 A = load 'myfile' as (x, u, v); B = load 'myotherfile' as (x, y, z); t1 = cogroup A by A.x, B by B.x; t2 = filter t1 by NOT IsEmpty(A) AND NOT IsEmpty(B); -- an inner join t3 = foreach t2 generate group, COUNT(B.y); While the Query 1 requires two separate MR jobs, the Query 2 only requires one MR job by using the COGROUP. 2. Use GROUP+FLATTEN to do the self join Sometimes we need a self join to get some additional information. For example, for each employer, find the average salary in his/her department. -- Query 1 A = load 'myfile' as (name, salary, department); t1 = group A by department; t2 = foreach t1 generate group, AVG(A.salary) as avg_salary; t3 = join A by department, t2 by group; -- Query 2 A = load 'myfile' as (name, salary, department); t1 = group A by department; t2 = foreach t1 generate FLATTEN(A), AVG(A.salary) as avg_salary; While the Query 1 needs two MR jobs, the Query 2 only requires one MR job by using FLATTEN after GROUP to implement the self join.
          Hide
          Thejas M Nair added a comment -

          You can just describe the documentation changes that are required. Corrine will incorporate it into the xml and also take care of the formatting.

          But if you really want to try out forrest, you can make changes to the xml, and then use 'ant docs' target to generate the html docs. You will need to download apache forrest and set forest.home to it.

          Show
          Thejas M Nair added a comment - You can just describe the documentation changes that are required. Corrine will incorporate it into the xml and also take care of the formatting. But if you really want to try out forrest, you can make changes to the xml, and then use 'ant docs' target to generate the html docs. You will need to download apache forrest and set forest.home to it.
          Hide
          Jie Li added a comment -

          Hi Daniel, as it's a xml file how can I see the html version to make sure the result looks OK? Or I don't need to care about the format?

          Show
          Jie Li added a comment - Hi Daniel, as it's a xml file how can I see the html version to make sure the result looks OK? Or I don't need to care about the format?
          Hide
          Daniel Dai added a comment -

          Hi, Jie
          The document source is src/docs/src/documentation/content/xdocs/perf.xml. But the best way is to describe the diff in the Jira, we will ask Corinne to change the document source.

          Show
          Daniel Dai added a comment - Hi, Jie The document source is src/docs/src/documentation/content/xdocs/perf.xml. But the best way is to describe the diff in the Jira, we will ask Corinne to change the document source.
          Hide
          Jie Li added a comment -

          Sorry I finally find some time working on this. May I know how to edit the pig performance page?

          Show
          Jie Li added a comment - Sorry I finally find some time working on this. May I know how to edit the pig performance page?
          Thejas M Nair made changes -
          Field Original Value New Value
          Link This issue is related to PIG-2397 [ PIG-2397 ]
          Thejas M Nair created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Thejas M Nair
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development