Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4601

Implement Merge CoGroup for Spark engine

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: spark-branch
    • Fix Version/s: spark-branch
    • Component/s: spark
    • Labels:
      None

      Description

      When doing a cogroup operation, we need do a map-reduce. The target of merge cogroup is implementing cogroup only by a single stage(map). But we need to guarantee the input data are sorted.

      There is performance improvement for cases when A(big dataset) merge cogroup B( small dataset) because we first generate an index file of A then loading A according to the index file and B into memory to do cogroup. The performance improves because there is no cost of reduce period comparing cogroup.

      How to use

      C = cogroup A by c1, B by c1 using 'merge';
      

      Here A and B is sorted.

        Attachments

        1. PIG-4601_1.patch
          26 kB
          liyunzhang
        2. PIG-4601_2.patch
          26 kB
          liyunzhang
        3. PIG-4601_3.patch
          23 kB
          liyunzhang
        4. PIG-4601_4.patch
          26 kB
          liyunzhang

          Issue Links

            Activity

              People

              • Assignee:
                kellyzly liyunzhang
                Reporter:
                mohitsabharwal Mohit Sabharwal
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: