Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4601

Implement Merge CoGroup for Spark engine

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • spark-branch
    • spark-branch
    • spark
    • None

    Description

      When doing a cogroup operation, we need do a map-reduce. The target of merge cogroup is implementing cogroup only by a single stage(map). But we need to guarantee the input data are sorted.

      There is performance improvement for cases when A(big dataset) merge cogroup B( small dataset) because we first generate an index file of A then loading A according to the index file and B into memory to do cogroup. The performance improves because there is no cost of reduce period comparing cogroup.

      How to use

      C = cogroup A by c1, B by c1 using 'merge';
      

      Here A and B is sorted.

      Attachments

        1. PIG-4601_4.patch
          26 kB
          liyunzhang
        2. PIG-4601_3.patch
          23 kB
          liyunzhang
        3. PIG-4601_2.patch
          26 kB
          liyunzhang
        4. PIG-4601_1.patch
          26 kB
          liyunzhang

        Issue Links

          Activity

            People

              kellyzly liyunzhang
              mohitsabharwal Mohit Sabharwal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: