Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
spark-branch
-
None
Description
When doing a cogroup operation, we need do a map-reduce. The target of merge cogroup is implementing cogroup only by a single stage(map). But we need to guarantee the input data are sorted.
There is performance improvement for cases when A(big dataset) merge cogroup B( small dataset) because we first generate an index file of A then loading A according to the index file and B into memory to do cogroup. The performance improves because there is no cost of reduce period comparing cogroup.
How to use
C = cogroup A by c1, B by c1 using 'merge';
Here A and B is sorted.
Attachments
Attachments
Issue Links
- links to