Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4420

Support for map side cross similar to replicate join

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Our CROSS implementation is very costly. Recently had a case where a user was doing a CROSS of 30million records against 3K records and it caused lot of disk error exceptions during the shuffle phase. We need to add support for a map side cross syntax

      C = CROSS A, B using 'replicate';

      The smaller table can be loaded in a list (hashmap in replicate join) and iterated through for each record in the bigger table. It should give a major performance boost and drastically reduce the resource usage.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            rohini Rohini Palaniswamy

            Dates

              Created:
              Updated:

              Slack

                Issue deployment